google-research / xtreme

XTREME is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models that covers 40 typologically diverse languages and includes nine tasks.
https://sites.research.google/xtreme
Apache License 2.0
631 stars 110 forks source link

XQUAD results reproducability for mBERT #8

Closed MaksymDel closed 4 years ago

MaksymDel commented 4 years ago

Hi, thanks for the benchmark and the accompanied code!

I am trying to replicate XQUAD scores from the (XTREME) paper using this repo's code. I run mBert cased model with default parameters and strictly follow the instruction in the README file.

However, the results for some languages are much lower than the scores from the paper. In particular for vi and th the gap is two-fold. There is also a significant drop for hi and el. The e.g. en, es, de result, on the other hand, is comparable.

Below I provide a table with scores that I just obtained from running the code together with the corresponding numbers from the paper. @sebastianruder, could I please ask you to take a look at it.

paper: {"f1", "exact_match"}

XQuAD 

  en {"exact_match": 71.76470588235294, "f1": 83.86480699632085} paper: 83.5 / 72.2 
  es {"exact_match": 53.94957983193277, "f1": 73.27239623706365} paper: 75.5 / 56.9 
  de {"exact_match": 52.35294117647059, "f1": 69.47398743963343} paper: 70.6 / 54.0 
  el {"exact_match": 33.61344537815126, "f1": 48.94642083187724} paper: 62.6 / 44.9 
  ru {"exact_match": 52.10084033613445, "f1": 69.82661430981189} paper: 71.3 / 53.3
  tr {"exact_match": 32.35294117647059, "f1": 46.14441800236999} paper: 55.4 / 40.1
  ar {"exact_match": 42.52100840336134, "f1": 59.72583892569921} paper: 61.5 / 45.1 
  vi {"exact_match": 15.210084033613445, "f1": 33.112047090752164} paper: 69.5 / 49.6 
  th {"exact_match": 15.294117647058824, "f1": 24.87707204093759} paper: 42.7 / 33.5 
  zh {"exact_match": 48.99159663865546, "f1": 58.654625486558196} paper: 58.0 / 48.3 
  hi {"exact_match": 22.436974789915965, "f1": 38.31058195464005} paper: 59.2 / 46.0 
sebastianruder commented 4 years ago

Hi Max, Thanks for your interest. For training BERT models on the QA tasks, we actually used the original BERT codebase as that was faster with Google infrastructure (see Appendix B in the paper). I'll check that the same results can be obtained with Transformers and will get back to you.

MaksymDel commented 4 years ago

Thanks, Sebastian!

Interesting to see if there were differences in hparams that caused such a the difference. I can immediately see that several choices are hardcoded in the google's codebase that differ from what you pass in transformers version: 1) linear learning rate decay in transformers vs polynomial lr decay in google's script 2) weight_decay=0.0001 in transformers vs weight_decay_rate=0.01 in google's script 3) adam epsilon=1e-8 in transformers vs 1e-6 in google's script

So unless you manually changed these values in the google's script these are some of the notable differences.

Meanwhile, I would like to also additionally confirm that the issue is only related to mBERT since for XLMR I got following avg numbers: 76.7 / 61.0 which is in pair with 76.6 / 60.8 from the paper.

sebastianruder commented 4 years ago

Thanks for the note, Max. Yes, these are some of the settings that should probably explain the difference in performance. Yes, for XLM-R we went with the implementation (and the default hyper-parameters) in Transformers, so this should work out-of-the-box as expected.

Liangtaiwan commented 4 years ago

@maksym-del, @sebastianruder If you use the scripts/train_qa.sh and scripts/predict_qa.sh, you should remove --do_lower_case argument by yourself. After removing the argument, I can get the results almost the same as the performance on paper.

line 53 and line 63 https://github.com/google-research/xtreme/blob/5d7e46217397797f287a324c8a1d75857e592885/scripts/train_qa.sh#L50-L71 https://github.com/google-research/xtreme/blob/5d7e46217397797f287a324c8a1d75857e592885/scripts/predict_qa.sh#L59-L66

Liangtaiwan commented 4 years ago

Here are the results I got on XQuAD

XQuAD
  en {"exact_match": 72.18487394957984, "f1": 84.05491660467752}
  es {"exact_match": 56.63865546218487, "f1": 75.50683844229154}
  de {"exact_match": 58.23529411764706, "f1": 73.97330302393942}
  el {"exact_match": 47.73109243697479, "f1": 64.71526367876008}
  ru {"exact_match": 54.285714285714285, "f1": 70.85210687094488}
  tr {"exact_match": 39.15966386554622, "f1": 54.04959679389641}
  ar {"exact_match": 47.39495798319328, "f1": 63.42460795613208}
  vi {"exact_match": 50.33613445378151, "f1": 69.39497841433942}
  th {"exact_match": 32.94117647058823, "f1": 42.04649738683358}
  zh {"exact_match": 48.99159663865546, "f1": 58.25216753368008}
  hi {"exact_match": 44.95798319327731, "f1": 58.764676794694026}
hit-computer commented 4 years ago

@Liangtaiwan Hi, I only find test data in download/xquad folder and this data just for test which do not have label. How can you get above result on XQuAD? Thanks :)

Liangtaiwan commented 4 years ago

@hit-computer You can find the labels here. https://github.com/deepmind/xquad

hit-computer commented 4 years ago

@Liangtaiwan Thank you very much!

sebastianruder commented 4 years ago

Hi @hit-computer, I've answered in the corresponding issue. Please don't post in other unrelated issues but instead tag people in your issue.

melvinjosej commented 4 years ago

Closing this issue. Please re-open if needed.