studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings
Apache License 2.0
705 stars 101 forks source link

Questions about entity embedding #104

Closed hedonihilist closed 2 years ago

hedonihilist commented 2 years ago

Hi all,

I have trouble understanding the meaning of entity_ids in the code.

https://github.com/studio-ousia/luke/blob/5023a8a4d534c6ae8cecc7c308f65bd3b078aa32/examples/ner/utils.py#L192

In the ner example code, the entity_id is either 1 or 0. What does 0 or 1 mean?

I am trying to obtain the embeddings of the entities in the text(entity positions can be resolved by external tools), how can I construct the entity_ids ?

ikuyamada commented 2 years ago

In the NER example, the entity token embedding is reconstructed here. Therefore, index 0 and index 1 correspond to the padding and [MASK] tokens, respectively. Entity embeddings should be obtained by inputting [MASK] entity token(s) to the model.

hedonihilist commented 2 years ago

Thanks for your reply.

If I understand it correctly, the meaning of entity_ids is task related. In the case of NER, they represents [MASK] or padding. In the case of relation extraction, they represents head and tail entities. Right?

If I want to get embedding of variable multiple entities in the sentence, how can I input achieve this? If I input multiple [MASK] entity tokens, how can I tell which is which?

ikuyamada commented 2 years ago

In the NER task, we input multiple [MASK] entities to the model to compute entity representations in an input text. If you input multiple [MASK] entities, the model can treat these entities differently if their entity_position_ids are different. If you need to input entity type information (e.g., HEAD or TAIL in the relation classification), you can create new entity tokens representing the entity types, and initialize these token embeddings using the token embedding of the [MASK] entity.

hedonihilist commented 2 years ago

Thanks!