Entity Mapping Preprocessing

Hi, first of all, thank you for the nice work.

Let's take the below input example.

"Everaldo has played for Guarani and Santa Cruz in the Campeonato Brasileiro, before moving to Mexico where he played for Chiapas and Necaxa." , entity: Guarani .

When training the model through the input, [MASK] token is added for masking Guarani entity. Then, the model is trained by predicting [MASK] as Guarani through Cross Entropy Loss.

However, when we analyze entity_vocab.json, there isn't "Guarani". The entity_vocab.json only have "Guarani language", "Guarani FC", "Tupi\u2013Guarani languages", "Guarani mythology". In that example, I believe that Guarani means Guarani FC.

Therefore, is the model trained to predict [MASK] as Guarani FC? If yes, we need to let the model know Guarani means Guarani FC. And, I guess that we need to match Guarani with Guarani FC.

The preprocessing in https://github.com/studio-ousia/luke/blob/master/pretraining.md, deals with such issues ?

Thank you.

studio-ousia / luke

Entity Mapping Preprocessing #169