KeyError: 'united_states'

mickeysjm / HiExpan

The source code used for automatic taxonomy construction method HiExpan, published in KDD 2018

GNU General Public License v3.0

71 stars 18 forks source link

KeyError: 'united_states' #3

Open hanayashiki opened 5 years ago

hanayashiki commented 5 years ago

Hello, I would like to test HiExpan on wiki corpus. After featureExtraction, I ran

~/HiExpan/src/HiExpan-new$ python3.6 main.py -data wiki

to test. But after loading those files in wiki/intermediate, I got:

=== Finish loading data ...... ===
=== Start loading seed supervision ...... ===
Traceback (most recent call last):
  File "main.py", line 120, in <module>
    newNode = TreeNode(parent=rootNode, level=0, eid=ename2eid[children], ename=children,
KeyError: 'united_states'

It seems that united_states is not included in those entities. What could possibly be wrong? Thank you.

hanayashiki commented 5 years ago

After I edited seedLoader.py from

    if corpusName == "wiki":
        userInput = [
            ["ROOT", -1, ["united_states", "china", "canada"]],
            ["united_states", 0, ["california", "illinois", "florida"]],
            ["china", 0, ["shandong", "zhejiang", "sichuan"]],
        ]

    if corpusName == "wiki":
        userInput = [
            ["ROOT", -1, ["United States", "China", "Canada"]],
            ["United States", 0, ["California", "Illinois", "Florida"]],
            ["China", 0, ["Shandong", "Zhejiang", "Sichuan"]],
        ]

It seems to be working. It seems that the phrases are not connect by "_" according to your paper.

mickeysjm commented 5 years ago

Thanks for pointing this out. The seed entities need to appear in the generated entity2id.txt file. I think the phrases are connected with "_" during the embedding learning and corpus preprocessing stage but then converted back. Glad to hear you have started running the expansion code. Thanks.

hanayashiki commented 5 years ago

Thanks for pointing this out. The seed entities need to appear in the generated entity2id.txt file. I think the phrases are connected with "_" during the embedding learning and corpus preprocessing stage but then converted back. Glad to hear you have started running the expansion code. Thanks.

I was using the preprocessed corpus downloaded from your given links. Maybe the sample inputs in the seedLoader.py should be changed to be compatible with that