studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings
Apache License 2.0
705 stars 101 forks source link

Entity Mapping Preprocessing #169

Open kimwongyuda opened 1 year ago

kimwongyuda commented 1 year ago

Hi, first of all, thank you for the nice work.

Let's take the below input example.

"Everaldo has played for Guarani and Santa Cruz in the Campeonato Brasileiro, before moving to Mexico where he played for Chiapas and Necaxa." , entity: Guarani .

When training the model through the input, [MASK] token is added for masking Guarani entity. Then, the model is trained by predicting [MASK] as Guarani through Cross Entropy Loss.

However, when we analyze entity_vocab.json, there isn't "Guarani". The entity_vocab.json only have "Guarani language", "Guarani FC", "Tupi\u2013Guarani languages", "Guarani mythology". In that example, I believe that Guarani means Guarani FC.

Therefore, is the model trained to predict [MASK] as Guarani FC? If yes, we need to let the model know Guarani means Guarani FC. And, I guess that we need to match Guarani with Guarani FC.

The preprocessing in https://github.com/studio-ousia/luke/blob/master/pretraining.md, deals with such issues ?

Thank you.

ryokan0123 commented 1 year ago

Hi @kimwongyuda,

The pretraining data of LUKE is constructed with text from Wikipedia, which is already annotated with ground-truth entities. So, the ambiguity of entity mentions will not be an issue in pretraining.

In your example, if "Guarani" is not in the entity vocabulary, the answer entity of [MASK] will be [UNK], or such entities are ignored depending on the setting.