Closed blazejdolicki closed 4 years ago
Hello Błażej, thanks for the interest in the project, and sorry for the delay, I recall the unsupervised portion was capped when running the original script. The full dataset, that has a size consistent with your calculation is at https://storage.googleapis.com/ulmfit/cls-full/de/books/unlabeled.csv
Let me know where you found the other URL and we can update it in the repo, or feel free to submit a PR.
Here the de-books data used for finetuning the LM is of size: 152523 + 16947 = 169470 which corresponds to the size of the original data from the xml file where the total size of data is also 169470. However, when I run
python prepare_cls.py https://storage.googleapis.com/ulmfit/cls
, the downloaded de.unsup.csv file has 29999 items. I checked and the sizes of train and test set are corresponding to logs in the link. So for some reason currently in the .csv files not all the data is available and thus the achieved results are worse than the ones from the link (which correspond to the results in the paper). Is there any explanation for that?