n-waves / multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761
MIT License
284 stars 56 forks source link

Why is ulmfit/postprocess_wikitext.py necessary? #44

Closed ohmeow closed 5 years ago

ohmeow commented 5 years ago

As building the vocab is now part of the Data Block APIs pre-processing, what is the need for this step in preparing the wikipedia data for training?

In particular, my questions are:

  1. Why build the vocab here and convert OOV tokens to when it can/is done as a pre-processing step via the DataBlock API?

  2. What is the purpose of the "-unk" folder? It doesn't seem like its being used anywhere else (though I may be mistaken), so just wondering why it exists?

  3. What is the reasoning behind the "replace_numbers" function?

I'm looking at some of the approach fastai folks are using to build pre-trained LMs in various languages, and I don't see them implementing the same post-processing here (in fact, the apporaches all seem to vary a little). Anyhow, just trying to get an understanding of how things should/need to be processed in a fasta v.1 world.

Thanks

PiotrCzapla commented 5 years ago

That was done following the way wiki104 data set was created, smerity removed the numbers and limited the vocab. But that isn't necessary and we most of the time train using unrestricted Wikipedia. I haven't removed that yet as it might come handy later and it doesn't hurt.