openai / deeptype

Code for the paper "DeepType: Multilingual Entity Linking by Neural Type System Evolution"
https://arxiv.org/abs/1802.01021
Other
649 stars 146 forks source link

where do you get candidate entities for a mention from? #49

Open teffland opened 5 years ago

teffland commented 5 years ago

I'm curious how you compute the candidate entities for some mention. In the paper it says you use a lookup table, but does not say how that lookup table is computed (or maybe I am missing it...)

Does the lookup table do any normalization of the mention text before lookup?

I ask because I'm wondering how the oracle accuracies are so high -- they are higher than the maximum possible recall of using crosswikis plus a wikipedia dump for CoNLL (98% after some text normalization, see Ganea and Hoffman '17)

mflis commented 5 years ago

Process of cleaning candidate links is described in Cleaning the data section of blog (and probably somewhere in the paper too).

Code performing this operation is in extraction/fast_link_fixer.py .

Some concrete examples(from polish wikipedia):

Article: https://pl.wikipedia.org/wiki/Abisynia Mention text Sudan that originally links to https://pl.wikipedia.org/wiki/Sudan_(region) (or https://www.wikidata.org/wiki/Q209703 in wikidata) is remapped to https://www.wikidata.org/wiki/Q1049

Article: https://pl.wikipedia.org/wiki/AWK Mention text bibliotek that originally links to https://pl.wikipedia.org/wiki/Biblioteka_programistyczna (or https://www.wikidata.org/wiki/Q188860 in wikidata) is remapped to https://www.wikidata.org/wiki/Q7075

Article: https://pl.wikipedia.org/wiki/Atom Mention text duchowe that originally links to https://pl.wikipedia.org/wiki/Duch_(filozofia) (or https://www.wikidata.org/wiki/Q193291 in wikidata) is remapped to https://www.wikidata.org/wiki/Q168796

teffland commented 5 years ago

Thanks, but unless I'm mistaken, that's only for the training data links. I'm confused about how candidates are generated at test time.

There are many examples in the CoNLL testb dataset that will have only the first or last name of an entity, for example, which are never seen in wikipedia as mentions. Just these types of mentions alone cause candidate generation using a lookup table to not contain the gold entity > 2% of the time, which is already below the oracle type accuracy in the paper.

So I'm curious if there is further normalization done at test time, or are the gold entities always candidates?

mflis commented 5 years ago

Oracle accuracy is calculated using extraction/evaluate_type_system.py

It takes configurable sample of wikipedia articles from xml dump and extracts links from them for evaluation. Main loop performing extraction is here. Helper code for cleaning and filtering of anchors is in wikidata_linker_utils_src/src/python/wikidata_linker_utils/anchor_filtering.py E.g. it filters out first names.

I don't know file format of CoNLL dataset, but this preprocessing looks specific for Wikipedia. I haven't seen in this repo code performing evalutation of Oracle accuracy for datasets other than wikipedia.

teffland commented 5 years ago

Hmm, I'm confused then where the numbers for the WKD30, CoNLL, and TAC 2010 columns in Table 1.c in the paper are created from then... do you have an idea?

muscleSunFlower commented 5 years ago

def load_aucs(): paths = [ "/home/jonathanraiman/en_field_auc_w10_e10.json", "/home/jonathanraiman/en_field_auc_w10_e10-s1234.json", "/home/jonathanraiman/en_field_auc_w5_e5.json", "/home/jonathanraiman/en_field_auc_w5_e5-s1234.json" ]

where are these come from??????~~~

nguyensinhtu commented 5 years ago

i still don't know how how model generate candidate entities. for examples : "CR7 is best soccer in the world", will model generate {"Ronaldo", "Cristiano Ronaldo "} from "CR7" ? thank u!!

ghost commented 4 years ago

I have the same question. It is said that "if types were given to us by an oracle, we find that it is possible to obtain accuracies of 98.6-99% on two benchmark tasks CoNLL (YAGO) and the TAC KBP 2010 challenge". But, how are candidate entities generated for the mentions in CoNLL (YAGO)? Can you please tell us how? @JonathanRaiman