n-waves / multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761
MIT License
282 stars 56 forks source link

Problems with reproducing zero-shot learning results #67

Open blazejdolicki opened 4 years ago

blazejdolicki commented 4 years ago

I tried replicating results for zero-shot learning on CLS, but my results don't match those from the paper. Since the script for predicting labels with LASER seems not be a part of Multifit repository I trained LASER on the CLS dataset (only en and de books for now) by adjusting the MLDoc script from LASER repo to CLS. My fork of LASER with these adjustment is [here]h(ttps://github.com/blazejdolicki/LASER). For the time being I only tested on books in German. After some hyperparameter tuning performed on English training set, my best setup obtains 82.25% accuracy compared to 84.15% from the Multifit paper. My hyperparams are:

n_epochs=200 lr=0.001 wd=0.0 nhid="10 8" drop=0.2 seed=1 bsize=12

and I'm using the last 10% of the test set as validation. When I tried to make them more similar to Multifit (n_epochs=8, wd=0.001,bsize=18), the accuracy dropped to around 60%.

Afterwards, I used the best (82.25% acc) LASER classifier (trained on English training set) to predict labels for German books. Then I copied test, training and unsupervised sets in Multifit repo from folder de-books into de-books-laser and replaced ground truth labels in training set with pseudolabels. Afterwards I trained the Multifit classifier on those pseudolabels and while my validation accuracy isn't great but at least similar, my test set accuracy is as low as 70% (compared to 89.60 from the paper and here) as you can see in the attached logs. Multifit CLS zero shot terrible results 15.04.2020.txt

I did expect some drop due to the issue explained in https://github.com/n-waves/multifit/issues/63, but such big difference shows that the unsupervised set size can't be the only factor deteriorating the results. Other possible reason of the drop in performance that come to my mind are:

My fork of mutlifit is here, I'm using the ulmfit-original-scripts branch.

I would really appreciate a reply :)

eisenjulian commented 4 years ago

Hey Blazej, I updated the other issue with a solution, can you let me know if that fixed the issue or you still cannot reproduce the results?

blazejdolicki commented 4 years ago

Thanks for your response. Using more data helped to some extent, but after some more digging I realized the real issue. The CLS dataset has three columns - label, summary and the actual review text. Initially, in zero-shot learning I was discarding the summary column thinking it's irrelevant. All that adding the summary does is it increases the amount of data used for finetuning the LM. After I included the summary to my surprise the classification test results jumped by ~15%! Without "summary" column the LM had 60% (val) accuracy in the first epoch (out of 20) while with it it has an accuracy of 37%. Not sure why including summaries that are usually shorter than the main text makes such a difference. The LM training time per epoch also changed from 18 seconds to 2 mins and 23 seconds.

So currently my laser results are still ~2% lower than those from the paper and so are zero-shot learning multifit results. So it's just a matter of differences in my implementation of CLS on LASER and yours. Do you have access to the script that you used to train LASER on CLS? Would be great to compare hyperparameters and check if they are responsible for this difference.