Replicating english universal language model

Make sure you have python 3.6 or greater, as the scripts make use of f-strings (and whomever might know what else).

Step 0

Do not download the english wiki dataset, it is huge. Either download the slovak one, or somehow make the script download only a small portion of the english dataset.

bash prepare_wiki.sh

prepare_wiki.sh fails when trying to run subscripts python wikiextractor/WikiExtractor.py -s --json -o "${EXTR_PATH}" "${DUMP_PATH}" and python merge_wiki.py -i "${EXTR_PATH}" -o "${OUT_PATH}" (python complains about "no such file"). These need to be run manually.

python wikiextractor/WikiExtractor.py -s --json -o "data/wiki_extr/sk" "data/wiki_dumps/skwiki-latest-pages-articles.xml.bz2"
python merge_wiki.py -i "data/wiki_extr/sk" -o "data/wiki/sk"

Also, merge_wiki.py is supposed to separate all the data into two piles: training and validation, but it fails to do so in a reasonable way---it picks a threshold N, and puts the first N examples into the training set and the rest into validation set. Thus when N is large enough, there will be no validation set... More reasonable behaviour would be to split the data like 90% training set, 10% validation set.

Step 1

From now onwards, fastai package is required. But not any version of it---it must be the old one (0.7.0)! Furthermore, it lists torch<0.4.0 as one of its requirements (in requirements.txt), but you may have new versions of torchvision and torchtext installed, which require newer torch. Thus we have to update these two as well...

pip install "torchvision<0.3"
pip install "torchtext<0.3.1"
svn checkout https://github.com/fastai/fastai/trunk/old
cd old
pip install .

If you happen to have CUDA, you're a lucky man (in a moment you will see why). Instead of installing the cpu-only version of Pytorch, install a cuda version (replace cpu with cuXX, where XX denotes the CUDA version: 80, 90, 92 or 100). If you do not use pip but something else, follow instructions at https://pytorch.org/get-started/previous-versions/.

Also, package fire is required, and spacy's xx model is required (for slovak data):

pip install fire
python -m spacy download xx

Now you're ready to run the create_toks.py script. If you have a shitty machine like me, however, you will want to process only a small portion of the dataset. For example, one can easily modify the script to process only the first N chunks of each dataset (where N is given by the user).

python create_toks.py data/wiki/sk --lang xx

Step 2

If you did not narrow the datasets, and you have a shitty machine, prepare for some system-freezes.

But first, we must modify the scripts so that they actually run. In all .py scripts, whenever there is a np.load(...), replace it with np.load(..., allow_pickle=True) (otherwise you will run into some problem related to that). In tok2id.py there are 2 occurences of this.

After that small adjustment, the script should be runnable.

tok2id.py data/wiki/sk

Step 3a

If you do not have CUDA, you must replace all occurences of cuda-specific constructions in the scripts with constructions that do not use cuda. The code is not CUDA/CPU agnostic. Ctrl-F all occurences of .cuda() and replace them with nothing. There is also one occurence of cuda.FloatTensor in sampled_sm.py, which should be replaced with FloatTensor only. Other occurences of cuda should not pose a problem for the CPU-only systems.

Now, the pretrain_lm.py script should run just fine.

python pretrain_lm.py data/wiki/sk -1 --lr 1e-3 --cl 12

Except that (on my machine) it takes 30s to run one iteration, there are 2000 iterations in one epoch, and there are 12 epochs.

Step 3b

To finetune the model, one has to have a pretrained model. We couldn't pretrain it due to lack of computational resources, and the model that can be downloaded are for english language only.

Thus this is a rather disappointing ending. I suggest we either shrink the dataset very significantly, or obtain access to some machine with CUDA...

maaario / nlp-slovak-universal-language-model