openai / deeptype

Code for the paper "DeepType: Multilingual Entity Linking by Neural Type System Evolution"
https://arxiv.org/abs/1802.01021
Other
647 stars 147 forks source link

trie.get('CIA') doesn't work on ja_trie #28

Closed ghost closed 6 years ago

ghost commented 6 years ago

I created ja_trie by running this:

./extraction/full_preprocess.sh ${DATA_DIR} ja

After that, checked this:

language_path = "../data/ja_trie/"

trie = marisa_trie.Trie().load(
    join(language_path, "trie.marisa")
)

assert trie.get('アメリカ') is not None

and it works. But if it contains any alphabet character, can't get anything:

assert trie.get('CIA') is not None
AssertionErrorTraceback (most recent call last)
<ipython-input-11-41516e200beb> in <module>()
----> 1 assert trie.get('CIA') is not None

AssertionError: 

Absolutely, jawiki contains 'CIA' as anchor text, but why this happen?

ghost commented 6 years ago

The key 'CIA' have to be lower characters, and Japanese multi-byte alphabets have to transform to ascii

trie.get('cia') # works
trie.get('CIA') # not works
trie.get('cia') #not works