huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.29k stars 26.35k forks source link

BERT multilingual for zero-shot classification #487

Closed ramild closed 5 years ago

ramild commented 5 years ago

Hi! I'm interested in solving a classification problem in which I train the model on one language and make the predictions for another one (zero-shot classification).

It is said in the README for the multilingual BERT model (https://github.com/google-research/bert/blob/master/multilingual.md) that:

For tokenization, we use a 110k shared WordPiece vocabulary. The word counts are weighted the same way as the data, so low-resource languages are upweighted by some factor. We intentionally do not use any marker to denote the input language (so that zero-shot training can work).

But after finetuning the BERT-multilingual-uncased for one language dataset, it absolutely doesn't work for the texts in another languages. Predictions turn out to be inadequate: I tried multiple pairs (text, the same text translated to another language) and probability distributions over labels (after apply softmax) were wildly different.

Do you know what can be the cause of the problem? Should I somehow change the tokenization when applying the model to other languages (BPE embeddings are shared, so not sure about this one)? Or should I use multilingual-cased instead of multilingual-uncased (is it possible it can be the source of the problem)?

ramild commented 5 years ago

UPD. I tried it with bert-multilingual-cased, but the results are still bad. A number of very simple (text, translated text) give very different probability distributions (the translated versions almost always fall into one major category).

Specifiically, I fine-tune pre-trained bert-multilingual-cased on Russian text classification problem and then make a prediction using the model on an English text (tried other languages -- nothing works).

thomwolf commented 5 years ago

Hi, my feeling is that this is still an open research problem.

Here is a recent thread discussing the related problem of fine-tuning BERT on English SQuAD and trying to do QA in another language. Maybe you can get a pre-print from the RecitalAI guys if they haven't published it yet.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.