Why is ulmfit/postprocess_wikitext.py necessary?

As building the vocab is now part of the Data Block APIs pre-processing, what is the need for this step in preparing the wikipedia data for training?

In particular, my questions are:

Why build the vocab here and convert OOV tokens to when it can/is done as a pre-processing step via the DataBlock API?
What is the purpose of the "-unk" folder? It doesn't seem like its being used anywhere else (though I may be mistaken), so just wondering why it exists?
What is the reasoning behind the "replace_numbers" function?

I'm looking at some of the approach fastai folks are using to build pre-trained LMs in various languages, and I don't see them implementing the same post-processing here (in fact, the apporaches all seem to vary a little). Anyhow, just trying to get an understanding of how things should/need to be processed in a fasta v.1 world.

Thanks

n-waves / multifit

Why is ulmfit/postprocess_wikitext.py necessary? #44