openai / deeptype

Code for the paper "DeepType: Multilingual Entity Linking by Neural Type System Evolution"
https://arxiv.org/abs/1802.01021
Other
647 stars 147 forks source link

KeyError 'enwiki/Human' extraction/classifiers/type_classifier.py #19

Closed ghpu closed 6 years ago

ghpu commented 6 years ago

issue running 'extraction/classifiers/type_classifier.py', please fix. 'enwiki/Human' Traceback (most recent call last): File "extraction/project_graph.py", line 123, in main classification = classifier.classify(collection) File "extraction/classifiers/type_classifier.py", line 26, in classify HUMAN = wkp(c, "Human") File "extraction/classifiers/type_classifier.py", line 14, in wkp return c.article2id['enwiki/' + name][0][0] File "src/marisa_trie.pyx", line 578, in marisa_trie.BytesTrie.getitem (src/marisa_trie.cpp:10859) KeyError: 'enwiki/Human'

Should type_classifier.py be updated somehow like fast_link_fixer.py ?

ghpu commented 6 years ago

More changes are required : Human can be replaced by 'Q5', but Food cannot be replaced by 'Q2095'

issue running 'extraction/classifiers/type_classifier.py', please fix. 'enwiki/Q2095' Traceback (most recent call last): File "extraction/project_graph.py", line 123, in main classification = classifier.classify(collection) File "extraction/classifiers/type_classifier.py", line 32, in classify FOOD = wkp(c, "Q2095") File "extraction/classifiers/type_classifier.py", line 14, in wkp return c.article2id['enwiki/' + name][0][0] File "src/marisa_trie.pyx", line 578, in marisa_trie.BytesTrie.getitem (src/marisa_trie.cpp:10859) KeyError: 'enwiki/Q2095'

JonathanRaiman commented 6 years ago

In classifiers two function help with construct indices from string names:

def wkp(c, name):
    return c.article2id['enwiki/' + name][0][0]

def wkd(c, name):
    return c.name2index[name]

the wkp function uses Wikipedia titles to get a numeric id (but assumes the name is from the English wikipedia), while wkd uses Wikidata's id scheme to get a numeric id. So you would have wkp(c, "Human") == wkd(c, "Q5"). Is there any other information you could give about this problem (e.g. Python version, wikipedia corpuses you extracted to run this) ?

Also, random theory: the issue might be bytes vs. str in the case you mentioned (e.g. do:

def wkp(c, name):
    return c.article2id[('enwiki/' + name).encode("utf-8")][0][0]

If for some reason marisa-trie requires byte keys on your installed version

ghpu commented 6 years ago

Unfortunately, I haven't stored the wikipedia corpuses any longer. They were downloaded on February 22th I have upgraded my system since, and am trying again with the updated code, will keep you informed if it is now solved.

JonathanRaiman commented 6 years ago

Thanks! Please re-open when if the problem persists :)

ghpu commented 6 years ago

Problem solved with latest version of code (commit 7271648f) , Python 3.6rc5 on Ubuntu Bionic Beaver (18.04), and latest wikipedia dumps (as of 23th March 2018).

Thanks for your time !