openai / deeptype

Code for the paper "DeepType: Multilingual Entity Linking by Neural Type System Evolution"
https://arxiv.org/abs/1802.01021
Other
649 stars 146 forks source link

KeyError in fast_link_fixer.py #35

Open EzBenson opened 6 years ago

EzBenson commented 6 years ago

When running 'full_process.sh' I seem to get a key error, the exact error message is:

'Traceback (most recent call last): File "extraction/fast_link_fixer.py", line 594, in main() File "extraction/fast_link_fixer.py", line 456, in main initialize_globals(c) File "extraction/fast_link_fixer.py", line 101, in initialize_globals ASPECT_OF_HIST = wkd(c, "Q17524420") File "extraction/fast_link_fixer.py", line 72, in wkd return c.name2index[name] File "/usr/local/lib/python3.5/dist-packages/wikidata_linker_utils/wikidata_ids.py", line 20, in getitem value = self.marisa[key] File "src/marisa_trie.pyx", line 577, in marisa_trie.BytesTrie.getitem KeyError: 'Q17524420''

More specifically it occurs when running the shell script line: 'python3 extraction/fast_link_fixer.py ${DATA_DIR}wikidata ${DATA_DIR}${LANGUAGE}_trie ${DATA_DIR}${LANGUAGE}_trie_fixed'

Would anybody be able to help me with this problem?

heisenbugfix commented 6 years ago

I got the same error for HISTORY - "Q309". I searched this id in the wikidata_ids.txt in "data/wikidata/" and it was not found. This might be the cause because if its not in wikidata_ids.txt then this item is not captured at all. I followed the exact steps as given in README.md. Still getting the above error. @JonathanRaiman - Please help!!

JonathanRaiman commented 6 years ago

@heisenbugfix You've hit a rare instance of Wikipedia/Wikidata deprecating or merging articles. In general this happens when an article is deemed spammy or too similar to another article. My suggestion here is to look for "Aspect of History" (what Q17524420 originally stood for) and try to find what it used to point to/was used for and find a good substitute, or remove altogether).

jcklie commented 5 years ago

I have the same problem when I either want to extract data or use the human generated type system:

sh extraction/full_preprocess.sh ${DATA_DIR} en ends with KeyError for

people Q2472587

export LANGUAGE=fr
export DATA_DIR=data/
python3 extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/classifiers/type_classifier.py

crashes with KeyError for:

Q10855242 (race horse)
Q23038290 (fossile taxon)
enwiki/food

I stopped removing after his point, I assume it is a deep rabbit hole. Is there an easy way around this? I really want to use this research, but I sadly cannot.

JonathanRaiman commented 5 years ago

@jcklie I suggest either finding an older Wikidata dump to download (no longer faces invalidation issues), or starting from scratch your own ruleset/script. I've had this issue when upgrading to newer wikidatas in the past, and usually there were ~3 broken ones that were usually associated to merges of infrequently used Q-ids.

jcklie commented 5 years ago

@JonathanRaiman Thank you for your quick response. I have another question about it: why does it also crash with "full_preprocess.sh"? I thought that this script automagically collects all the data from scratch while using the newest Wikipedia and Wikidata.

JonathanRaiman commented 5 years ago

@jcklie Final step of full_preprocess calls fast_link_fixer.py a step that calls/uses several Q-ids to construct inheritance rules for "fixing" wikipedia links (e.g. changing the counts so that they get grouped in more semantic ways). This step is optional (e.g. the "non-fixed" counts are also compatible with the code). Nonetheless, if you find that errors occur on this step, you can re-run just that script separate from the "full_preprocess.sh" pipeline and replace/update the missing Q-id with its valid new Q-id.

For instance fast_link_fixer.py has on line 99: PEOPLE = wkd(c, "Q2472587"), and suppose Q2472587 has vanished, then I would suggest finding other parent classes for an instance of "people" (e.g. Jewish people Q7325 has "nation" and "ethnoreligious group" as possible alternative parents).

Concerning Q2472587 specifically, I'm a bit confused because it still shows up here, so I'm not sure what went wrong in the extraction process. If you can post the traceback/error that might help track down where/why some Q-ids went missing.

jcklie commented 5 years ago

@JonathanRaiman

I run sh extraction/full_preprocess.sh ${DATA_DIR} en

The end is

Construct mapping 100% (254777507 lines) |######################|Time:  1:04:32
loaded trie
61455967/254777507 anchor_tags could not be found in wikidata
3264/254777507 anchor_tags links were malformed/too long
Missing anchor_tags sample:
    Anarchism -> Anarchism
    self-governed -> Self-governance
    hierarchies -> Hierarchy
    stateless societies -> Stateless society
    anarcho-capitalism -> anarcho-capitalism
    anarchist legal philosophy -> Anarchist law
    anti-authoritarian interpretations -> Libertarian socialism
    collectivism -> Collectivist anarchism
    individualism -> individualism
    social -> Social anarchism
/home/klie/entity-linking/deeptype/venv/lib64/python3.6/site-packages/wikidata_linker_utils/type_collection.py:351: UserWarning: Node 'Q3679160' under `bad_node` is not a known wikidata id.
  el

...

loading wikidata id -> index
done
Traceback (most recent call last):
  File "extraction/fast_link_fixer.py", line 594, in <module>
    main()
  File "extraction/fast_link_fixer.py", line 456, in main
    initialize_globals(c)
  File "extraction/fast_link_fixer.py", line 99, in initialize_globals
    PEOPLE = wkd(c, "Q2472587")
  File "extraction/fast_link_fixer.py", line 72, in wkd
    return c.name2index[name]
  File "/home/klie/entity-linking/deeptype/venv/lib64/python3.6/site-packages/wikidata_linker_utils/wikidata_ids.py", line 20, in __getitem__
    value = self.marisa[key]
  File "marisa_trie.pyx", line 462, in marisa_trie.BytesTrie.__getitem__ (src/marisa_trie.cpp:8352)
KeyError: 'Q2472587'
dungtn commented 5 years ago

I have the same problem and I tried looking at the constructed trie. It looks like a lot of category links and anchor tags are missing:

32644079/33088413 category links could not be found in wikidata 85/33088413 category links were malformed Missing links sample: 'enwiki/Anarchism' -> 'enwiki/Category:Anarchism' 'enwiki/Anarchism' -> 'enwiki/Category:Anti-capitalism' 'enwiki/Anarchism' -> 'enwiki/Category:Anti-fascism' 'enwiki/Anarchism' -> 'enwiki/Category:Far-left politics' 'enwiki/Anarchism' -> 'enwiki/Category:Libertarian socialism' 'enwiki/Anarchism' -> 'enwiki/Category:Political culture' 'enwiki/Anarchism' -> 'enwiki/Category:Political ideologies' 'enwiki/Anarchism' -> 'enwiki/Category:Social theories' 'enwiki/Autism' -> 'enwiki/Category:Autism' 'enwiki/Autism' -> 'enwiki/Category:Articles containing video clips'

162759256/255575773 anchor_tags could not be found in wikidata 3286/255575773 anchor_tags links were malformed/too long Missing anchor_tags sample: Anarchism -> Anarchism anti-authoritarian -> anti-authoritarian political philosophy -> political philosophy self-governed -> Self-governance cooperative -> cooperative hierarchies -> Hierarchy stateless societies -> Stateless society free associations -> Free association (communism and anarchism) state -> State (polity) far-left -> Far-left politics

Most of the entities in fast_link_fixer.py are not in the trie but are still available on Wikidata. It seems fine to me to just ignore them but I'm not sure whether it will affect the results?

Also, it's a stupid question but I couldn't find the place to download the same Wiki dump as mentioned in the paper. Can you point me to it?

All the best.

linhlt2689 commented 5 years ago

Hi. Can you tell me how to map Qxxx with an entity with the wikidata we download. I dont find any file to map these things. Please help. Tks

muscleSunFlower commented 5 years ago

def load_aucs(): paths = [ "/home/jonathanraiman/en_field_auc_w10_e10.json", "/home/jonathanraiman/en_field_auc_w10_e10-s1234.json", "/home/jonathanraiman/en_field_auc_w5_e5.json", "/home/jonathanraiman/en_field_auc_w5_e5-s1234.json" ]

where are these come from??????~~~

Lavine24 commented 4 years ago

Could you please tell us that what's the dump file you have used in your paper? we failed in this step, so can not do next step, Thank you very much.

lbozarth commented 4 years ago

I have the same issue with "Q2472587" did anyone fix it?

ghost commented 4 years ago

I have the same issue with "Q2472587" did anyone fix it?

I am having the same problem. Did anyone manage to fix it??

zbeloki commented 3 years ago

Same problem here with Q20871948