Open akshaykgupta opened 5 years ago
So without any response maybe we can start coming with our own theories 😄
The only transfer mechanism I can identify comes from the sentpiece tokenizer. In cases where words in different languages with similar meaning share a common fragment, the representation learnt for that fragment in one language should be transferable to another language.
This is probably not the whole story though, since e.g the chinese sentpiece vocabulary probably doesn't have a huge overlap with other languages 🤔
Seeing the README for the multilingual BERT model, the zero shot results on the XNLI dataset are quite decent, especially for languages closer to English like Spanish/German. How is BERT able to map two sentences in different languages to a similar embedding? Based on how the pre-training takes place, I don't see any reason why it should do that, as opposed to say the LASER model from FAIR, which was trained for that specific purpose. Would greatly appreciate some insight on this.