google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.17k stars 9.6k forks source link

BERT multilingual for zero-shot classification #577

Open ramild opened 5 years ago

ramild commented 5 years ago

Hi! I'm interested in solving a classification problem in which I train the model on one language and make the predictions for another one (zero-shot classification).

It is said in the README for the multilingual BERT model (https://github.com/google-research/bert/blob/master/multilingual.md) that:

For tokenization, we use a 110k shared WordPiece vocabulary. The word counts are weighted the same way as the data, so low-resource languages are upweighted by some factor. We intentionally do not use any marker to denote the input language (so that zero-shot training can work).

But after finetuning the BERT-multilingual-uncased for one language dataset, it absolutely doesn't work for the texts in another languages. Predictions turn out to be inadequate: I tried multiple pairs (text, the same text translated to another language) and probability distributions over labels (after apply softmax) were wildly different.

Then I also tried it with bert-multilingual-cased, but the results are still bad. A number of very simple (text, translated text) give very different probability distributions (the translated versions almost always fall into one major category).

Specifiically, I fine-tune pre-trained bert-multilingual-cased on Russian text classification problem and then make a prediction using the model on an English text (tried German, Spanish and Italian as well -- nothing works).

Do you know what can be the cause of the problem? Should I somehow change the tokenization when applying the model to other languages (BPE embeddings are shared, so not sure about this one)?

dengyuning commented 5 years ago

Hi! I'm interested in solving a classification problem in which I train the model on one language and make the predictions for another one (zero-shot classification).

It is said in the README for the multilingual BERT model (https://github.com/google-research/bert/blob/master/multilingual.md) that:

For tokenization, we use a 110k shared WordPiece vocabulary. The word counts are weighted the same way as the data, so low-resource languages are upweighted by some factor. We intentionally do not use any marker to denote the input language (so that zero-shot training can work).

But after finetuning the BERT-multilingual-uncased for one language dataset, it absolutely doesn't work for the texts in another languages. Predictions turn out to be inadequate: I tried multiple pairs (text, the same text translated to another language) and probability distributions over labels (after apply softmax) were wildly different.

Then I also tried it with bert-multilingual-cased, but the results are still bad. A number of very simple (text, translated text) give very different probability distributions (the translated versions almost always fall into one major category).

Specifiically, I fine-tune pre-trained bert-multilingual-cased on Russian text classification problem and then make a prediction using the model on an English text (tried German, Spanish and Italian as well -- nothing works).

Do you know what can be the cause of the problem? Should I somehow change the tokenization when applying the model to other languages (BPE embeddings are shared, so not sure about this one)?

Met the same problem. My wechat id is cherryuuuu3 and maybe we can discuss about it.

wanicca commented 5 years ago

I wonder if bert multilingual representations can perform like other multilingual embeddings by aligning monolingual embeddings (like MUSE )? That is to say, do the synonyms in a parallel sentence in different languages have analogous vector representations? Is bert multilingual model cross-lingual, or just a multilingual model that can receive different languages as input? I read the Multilingual README, and didn't find any clue about cross-lingual setting.