Closed MaksymDel closed 4 years ago
Hi Max, Thanks for your interest. For training BERT models on the QA tasks, we actually used the original BERT codebase as that was faster with Google infrastructure (see Appendix B in the paper). I'll check that the same results can be obtained with Transformers and will get back to you.
Thanks, Sebastian!
Interesting to see if there were differences in hparams that caused such a the difference. I can immediately see that several choices are hardcoded in the google's codebase
that differ from what you pass in transformers
version:
1) linear learning rate decay
in transformers vs polynomial lr decay
in google's script
2) weight_decay=0.0001
in transformers vs weight_decay_rate=0.01
in google's script
3) adam epsilon=1e-8
in transformers vs 1e-6
in google's script
So unless you manually changed these values in the google's script these are some of the notable differences.
Meanwhile, I would like to also additionally confirm that the issue is only related to mBERT
since for XLMR
I got following avg numbers: 76.7 / 61.0
which is in pair with 76.6 / 60.8
from the paper.
Thanks for the note, Max. Yes, these are some of the settings that should probably explain the difference in performance. Yes, for XLM-R we went with the implementation (and the default hyper-parameters) in Transformers, so this should work out-of-the-box as expected.
@maksym-del, @sebastianruder If you use the scripts/train_qa.sh and scripts/predict_qa.sh, you should remove --do_lower_case argument by yourself. After removing the argument, I can get the results almost the same as the performance on paper.
line 53 and line 63 https://github.com/google-research/xtreme/blob/5d7e46217397797f287a324c8a1d75857e592885/scripts/train_qa.sh#L50-L71 https://github.com/google-research/xtreme/blob/5d7e46217397797f287a324c8a1d75857e592885/scripts/predict_qa.sh#L59-L66
Here are the results I got on XQuAD
XQuAD
en {"exact_match": 72.18487394957984, "f1": 84.05491660467752}
es {"exact_match": 56.63865546218487, "f1": 75.50683844229154}
de {"exact_match": 58.23529411764706, "f1": 73.97330302393942}
el {"exact_match": 47.73109243697479, "f1": 64.71526367876008}
ru {"exact_match": 54.285714285714285, "f1": 70.85210687094488}
tr {"exact_match": 39.15966386554622, "f1": 54.04959679389641}
ar {"exact_match": 47.39495798319328, "f1": 63.42460795613208}
vi {"exact_match": 50.33613445378151, "f1": 69.39497841433942}
th {"exact_match": 32.94117647058823, "f1": 42.04649738683358}
zh {"exact_match": 48.99159663865546, "f1": 58.25216753368008}
hi {"exact_match": 44.95798319327731, "f1": 58.764676794694026}
@Liangtaiwan Hi, I only find test data in download/xquad
folder and this data just for test which do not have label. How can you get above result on XQuAD? Thanks :)
@hit-computer You can find the labels here. https://github.com/deepmind/xquad
@Liangtaiwan Thank you very much!
Hi @hit-computer, I've answered in the corresponding issue. Please don't post in other unrelated issues but instead tag people in your issue.
Closing this issue. Please re-open if needed.
Hi, thanks for the benchmark and the accompanied code!
I am trying to replicate XQUAD scores from the (XTREME) paper using this repo's code. I run
mBert cased
model with default parameters and strictly follow the instruction in the README file.However, the results for some languages are much lower than the scores from the paper. In particular for
vi
andth
the gap is two-fold. There is also a significant drop forhi
andel
. The e.g.en, es, de
result, on the other hand, is comparable.Below I provide a table with scores that I just obtained from running the code together with the corresponding numbers from the paper. @sebastianruder, could I please ask you to take a look at it.