google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.28k stars 9.62k forks source link

How is BERT able to do zero shot transfer on XNLI dataset? #457

Open akshaykgupta opened 5 years ago

akshaykgupta commented 5 years ago

Seeing the README for the multilingual BERT model, the zero shot results on the XNLI dataset are quite decent, especially for languages closer to English like Spanish/German. How is BERT able to map two sentences in different languages to a similar embedding? Based on how the pre-training takes place, I don't see any reason why it should do that, as opposed to say the LASER model from FAIR, which was trained for that specific purpose. Would greatly appreciate some insight on this.

suned commented 5 years ago

So without any response maybe we can start coming with our own theories 😄

The only transfer mechanism I can identify comes from the sentpiece tokenizer. In cases where words in different languages with similar meaning share a common fragment, the representation learnt for that fragment in one language should be transferable to another language.

This is probably not the whole story though, since e.g the chinese sentpiece vocabulary probably doesn't have a huge overlap with other languages 🤔