n-waves / multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761
MIT License
282 stars 56 forks source link

Different size of CLS unsupervised data between .csv and original .xml files #63

Closed blazejdolicki closed 4 years ago

blazejdolicki commented 4 years ago

Here the de-books data used for finetuning the LM is of size: 152523 + 16947 = 169470 which corresponds to the size of the original data from the xml file where the total size of data is also 169470. However, when I run python prepare_cls.py https://storage.googleapis.com/ulmfit/cls, the downloaded de.unsup.csv file has 29999 items. I checked and the sizes of train and test set are corresponding to logs in the link. So for some reason currently in the .csv files not all the data is available and thus the achieved results are worse than the ones from the link (which correspond to the results in the paper). Is there any explanation for that?

eisenjulian commented 4 years ago

Hello Błażej, thanks for the interest in the project, and sorry for the delay, I recall the unsupervised portion was capped when running the original script. The full dataset, that has a size consistent with your calculation is at https://storage.googleapis.com/ulmfit/cls-full/de/books/unlabeled.csv

Let me know where you found the other URL and we can update it in the repo, or feel free to submit a PR.