Different size of CLS unsupervised data between .csv and original .xml files

n-waves / multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761

MIT License

284 stars 56 forks source link

Here the de-books data used for finetuning the LM is of size: 152523 + 16947 = 169470 which corresponds to the size of the original data from the xml file where the total size of data is also 169470. However, when I run python prepare_cls.py https://storage.googleapis.com/ulmfit/cls, the downloaded de.unsup.csv file has 29999 items. I checked and the sizes of train and test set are corresponding to logs in the link. So for some reason currently in the .csv files not all the data is available and thus the achieved results are worse than the ones from the link (which correspond to the results in the paper). Is there any explanation for that?

n-waves / multifit

Different size of CLS unsupervised data between .csv and original .xml files #63