Open teffland opened 5 years ago
Process of cleaning candidate links is described in Cleaning the data
section of blog (and probably somewhere in the paper too).
Code performing this operation is in extraction/fast_link_fixer.py .
Article: https://pl.wikipedia.org/wiki/Abisynia
Mention text Sudan
that originally links to https://pl.wikipedia.org/wiki/Sudan_(region) (or https://www.wikidata.org/wiki/Q209703 in wikidata) is remapped to https://www.wikidata.org/wiki/Q1049
Article: https://pl.wikipedia.org/wiki/AWK
Mention text bibliotek
that originally links to https://pl.wikipedia.org/wiki/Biblioteka_programistyczna (or https://www.wikidata.org/wiki/Q188860 in wikidata) is remapped to https://www.wikidata.org/wiki/Q7075
Article: https://pl.wikipedia.org/wiki/Atom
Mention text duchowe
that originally links to https://pl.wikipedia.org/wiki/Duch_(filozofia) (or https://www.wikidata.org/wiki/Q193291 in wikidata) is remapped to https://www.wikidata.org/wiki/Q168796
Thanks, but unless I'm mistaken, that's only for the training data links. I'm confused about how candidates are generated at test time.
There are many examples in the CoNLL testb dataset that will have only the first or last name of an entity, for example, which are never seen in wikipedia as mentions. Just these types of mentions alone cause candidate generation using a lookup table to not contain the gold entity > 2% of the time, which is already below the oracle type accuracy in the paper.
So I'm curious if there is further normalization done at test time, or are the gold entities always candidates?
Oracle accuracy is calculated using extraction/evaluate_type_system.py
It takes configurable sample of wikipedia articles from xml dump and extracts links from them for evaluation. Main loop performing extraction is here. Helper code for cleaning and filtering of anchors is in wikidata_linker_utils_src/src/python/wikidata_linker_utils/anchor_filtering.py E.g. it filters out first names.
I don't know file format of CoNLL dataset, but this preprocessing looks specific for Wikipedia. I haven't seen in this repo code performing evalutation of Oracle accuracy for datasets other than wikipedia.
Hmm, I'm confused then where the numbers for the WKD30, CoNLL, and TAC 2010 columns in Table 1.c in the paper are created from then... do you have an idea?
def load_aucs(): paths = [ "/home/jonathanraiman/en_field_auc_w10_e10.json", "/home/jonathanraiman/en_field_auc_w10_e10-s1234.json", "/home/jonathanraiman/en_field_auc_w5_e5.json", "/home/jonathanraiman/en_field_auc_w5_e5-s1234.json" ]
where are these come from??????~~~
i still don't know how how model generate candidate entities. for examples : "CR7 is best soccer in the world", will model generate {"Ronaldo", "Cristiano Ronaldo "} from "CR7" ? thank u!!
I have the same question. It is said that "if types were given to us by an oracle, we find that it is possible to obtain accuracies of 98.6-99% on two benchmark tasks CoNLL (YAGO) and the TAC KBP 2010 challenge". But, how are candidate entities generated for the mentions in CoNLL (YAGO)? Can you please tell us how? @JonathanRaiman
I'm curious how you compute the candidate entities for some mention. In the paper it says you use a lookup table, but does not say how that lookup table is computed (or maybe I am missing it...)
Does the lookup table do any normalization of the mention text before lookup?
I ask because I'm wondering how the oracle accuracies are so high -- they are higher than the maximum possible recall of using crosswikis plus a wikipedia dump for CoNLL (98% after some text normalization, see Ganea and Hoffman '17)