Larger 13B model underperforms BASE model, any idea why ?

unicamp-dl / mMARCO

A multilingual version of MS MARCO passage ranking dataset

Apache License 2.0

142 stars 9 forks source link

Larger 13B model underperforms BASE model, any idea why ? #22

Open cramraj8 opened 4 months ago

cramraj8 commented 4 months ago

I tried to evaluate both unicamp-dl/mt5-base-en-msmarco and unicamp-dl/mt5-13b-mmarco-100k, but the performance of 13b is lower than base model. Here is a simple comparison of reranking results of BM25 top-100 results measured in nDCG@10. Did you observe similar trend, or there can be any underling reasons ? @rodrigonogueira4

rodrigonogueira4 commented 4 months ago

Hi @vjeronymo2 @lhbonifacio maybe do you have a hint of what is going on here?

lhbonifacio commented 4 months ago

Hi @cramraj8 From the languages in your results I guess you are using Mr Tydi, right? I would say that the gap from 580M (mT5-base) to 13B and the multi-language are the main issues here. As a hint, we have observed similar results when trying to finetune mT5 models for 10k steps (as this number has generated better results for the monoT5-english version). However, finetuning for 10k steps in a multi-language scenario was just not enough for the model to learn the reranking task. That is the reason you cannot find any multi-language model finetuned for just 10k in our huggingface hub. You are going up in the number of parameters scale, but not following it in the training data scale, so I would say that's the reason here.

cramraj8 commented 4 months ago

Hi @lhbonifacio , yes I am evaluating on Mr TyDi. I am a bit of confused here.

If I interpret your reply correctly, monoT5-english version shows optimum performance with only 10k training. But mT5-base does not show optimum at 10k, so you had to train till 100k to show improvements. However, mT5-13B trained with 100k is not yet optimum because we should train on even larger training data because the model size now has increased from base to 13B. Is that accurate ?

In summary, in the context of multilingual re-ranking when the model size increases (580M --> 13B) we should increase the training iterations or training sample size too ?