huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.02k stars 26.8k forks source link

run_pl_glue.py (almost equivalent performance with non-english bert models) #8626

Closed timpal0l closed 3 years ago

timpal0l commented 3 years ago

Environment info

Who can help

perhaps @sgugger ## Information I tested running the glue benchmark with a few non-english models such as; Arabic, Swedish and Chinese. Models I am using: `asafaya/bert-base-arabic`, `KB/bert-base-swedish-cased`, `bert-base-chinese`. I recieve almost identical results as in [Run PyTorch version](https://github.com/huggingface/transformers/tree/master/examples/text-classification#run-pytorch-version), it differs with a few percentages for each task, where some are even slightly better than using the default `bert-base-cased` Am not sure this is a bug, but it seems a bit strange that with using different embeddings that are really far away from English such as Arabic and Chinese I get very similair results. The problem arises when using: * [X] the official example scripts: [run_glue.py](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py) * [ ] my own modified scripts: (give details below) The tasks I am working on is: * [X] an official GLUE/SQUaD task: GLUE, (sts-b in this example) * [ ] my own task or dataset: (give details below) ## To reproduce I get almost identical results when running a non-english bert on the glue benchmark. In this case on `stsb` using the `bert-base-chinese`, `asafaya/bert-base-arabic` and `KB/bert-base-swedish-cased`. ``` export TASK_NAME=stsb python run_glue.py \ --model_name_or_path bert-base-chinese \ --task_name $TASK_NAME \ --do_train \ --do_eval \ --max_seq_length 128 \ --per_device_train_batch_size 32 \ --learning_rate 2e-5 \ --num_train_epochs 3.0 \ --output_dir /tmp/$TASK_NAME/ ``` Chinese: ``` 11/18/2020 17:10:42 - INFO - __main__ - ***** Eval results stsb ***** 11/18/2020 17:10:42 - INFO - __main__ - eval_loss = 0.8410218954086304 11/18/2020 17:10:42 - INFO - __main__ - eval_pearson = 0.7922208042884891 11/18/2020 17:10:42 - INFO - __main__ - eval_spearmanr = 0.7956508384154777 11/18/2020 17:10:42 - INFO - __main__ - eval_combined_score = 0.7939358213519834 11/18/2020 17:10:42 - INFO - __main__ - epoch = 3.0 ``` Arabic: ``` 11/18/2020 17:14:04 - INFO - __main__ - ***** Eval results stsb ***** 11/18/2020 17:14:04 - INFO - __main__ - eval_loss = 0.8082903027534485 11/18/2020 17:14:04 - INFO - __main__ - eval_pearson = 0.8357733212850804 11/18/2020 17:14:04 - INFO - __main__ - eval_spearmanr = 0.8386964712863125 11/18/2020 17:14:04 - INFO - __main__ - eval_combined_score = 0.8372348962856965 11/18/2020 17:14:04 - INFO - __main__ - epoch = 3.0 ``` Swedish: ``` 11/18/2020 17:32:26 - INFO - __main__ - ***** Eval results stsb ***** 11/18/2020 17:32:26 - INFO - __main__ - eval_loss = 0.7071832418441772 11/18/2020 17:32:26 - INFO - __main__ - eval_pearson = 0.8379047445076137 11/18/2020 17:32:26 - INFO - __main__ - eval_spearmanr = 0.8350383734219187 11/18/2020 17:32:26 - INFO - __main__ - eval_combined_score = 0.8364715589647662 11/18/2020 17:32:26 - INFO - __main__ - epoch = 3.0 ``` Is expected behaviour? Meaning that the readaption of the embedding matrices can work with non english vocabs such as Chinese and Arabic since they perhaps contain some latin characters. With English model `bert-base-cased` we get pearson: `83.95` and Arabic model `asafaya/bert-base-arabic` pearson: `83.57`.

Thanks!

Expected behavior

Not sure..

joeddav commented 3 years ago

I can't say with certainty, but I actually think it's entirely feasible that this is legitimate result. Here's a recent ACL paper showing that a monolingual model can be fine-tuned on another language with competitive performance. The authors do learn a new embedding layer for the new target language in an intermediate pre-training step, so it's not entirely the same, but I wouldn't find this result too surprising. It's also likely that these non-English models had exposure to some English that wasn't scrubbed from their pre-training corpora, in which case the model might already have decent embeddings for tokens sourced from English text just from pre-training.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.