Tokenizer - Githubissues

n-waves / multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761

MIT License

284 stars 56 forks source link

Tokenizer #71

Closed javithe7 closed 4 years ago

javithe7 commented 4 years ago

Hi everyone,

Does anyone know which tokenizer Multifit uses?(especially in spanish texts), as well as the method used to vectorize them. I'd like to be able to tokenize and vectorize texts in the same way that multifit does internally.

eisenjulian commented 4 years ago

Hello @javithe7 we use sentencepiece tokenization, which has been added to fastai directly. You can check the documentation at https://docs.fast.ai/text.data.html#TextList