studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings
Apache License 2.0
705 stars 102 forks source link

Entities that are not in wikipedia #143

Closed emilysilcock closed 2 years ago

emilysilcock commented 2 years ago

Hello,

Thanks for this great repo.

Perhaps this is indicative of some fundamental misunderstanding on my part, but I was wondering if you could give me an idea of what LUKE would do if it came across a mention of an entity that does not exist in Wikipedia. Will it always try to match this to the closest entity or is there some way of saying there are no close entities?

Many thanks

ryokan0123 commented 2 years ago

Hi,

Thanks for your interest! Let me answer your questions.

LUKE can only handle entities that are in the LUKE's entity vocabulary. For example, the luke-base model has the top 500K entities from the English Wikipedia in its entity vocabulary, so it cannot handle other entities that are not in the vocabulary.

what LUKE would do if it came across a mention of an entity that does not exist in Wikipedia. Will it always try to match this to the closest entity or is there some way of saying there are no close entities?

You need to detect entities in the input text in preprocessing before feeding the input to LUKE. So LUKE would not do entity linking or matching for you. It is our job to do that and deal with out-of-vocabulary entities.

emilysilcock commented 2 years ago

Thanks for the extremely quick reply. I understand that it couldn't (correctly) match out-of-vocabulary entities, but would it try to do so and come up with an incorrect prediction, or would it just ignore them (ie. make no prediction)?

Thanks!

ryokan0123 commented 2 years ago

would it try to do so and come up with an incorrect prediction, or would it just ignore them It is totally up to the preprocessing step. The model just makes predictions for whatever is given as input.

So, if you decide not to include OOV tokens in the input, the LUKE model will just ignore them. Or you can convert those OOV entities to [UNK] or [MASK] tokens and include them in the input. In this case, the model will try to make predictions with them.

FYI, luke-base and luke-large are trained with [UNK] entity tokens, so they are kind of able to handle OOV entities to some extent.

emilysilcock commented 2 years ago

Thanks, that's super helpful to know.

I guess it is unclear to me how you could ever know that an entity was OOV prior to disambiguation (in an inference set). Surely if you had some way to know whether it was in the vocabulary or not, the work of entity disambiguation is already 90% done? Maybe this is just an open question!

ryokan0123 commented 2 years ago

I guess it is unclear to me how you could ever know that an entity was OOV prior to disambiguation (in an inference set).

Actually, the LUKE model assumes that all the entities in the input are all correctly disambiguated or just converted into [UNK] or [MASK] tokens. For example, in pretraining, we use Wikipedia articles and the sentences contain hyperlinks, which can be regarded as entity annotation without ambiguity. In relation classification, we just convert the head and tail entities into [MASK] tokens to make predictions.

To apply LUKE in other contexts, we expect that LUKE is used with some entity linking or disambiguation systems such as this.

emilysilcock commented 2 years ago

Okay, my bad, I was asking this about the method in the paper that you linked, not about LUKE in general - I just followed the github link from https://arxiv.org/pdf/1909.00426.pdf. Sorry for the confusion!

So to go back to my original question: is the method that you propose for entity disambiguation able to handle OOV entities? (ie. ones that don't appear in Wikipedia). Obviously it can't disambiguate them, but does it have the capacity to make no prediction rather than a wrong prediction?

Thanks! And sorry about the misunderstanding

ikuyamada commented 2 years ago

Hi @emilysilcock, Thanks for you interest in our entity disambiguation work! Unfortunately, our current entity disambiguation model does not handle OOV entities. The publicized model is trained only with the candidate entities appearing in the datasets. However, as suggested by @Ryou0634, it may be possible to detect NIL entities by training the model with [UNK] entity, and detecting the OOV entities using the token.

emilysilcock commented 2 years ago

Thanks, that's helpful to know!