vered1986 / HypeNET

Integrated path-based and distributional method for hypernymy detection
Other
85 stars 13 forks source link

wikipedia eump file #4

Closed wt123u closed 6 years ago

wt123u commented 6 years ago

Hey Vered, I am very interested in trying your code too. but i don't know the format of wikipedia dump file. Is it xml or json ?

vered1986 commented 6 years ago

Hi @wt123u,

I downloaded the XML dump and converted it to a text corpus. You can do so as suggested here or using the WikiExtractor.

You can follow the detailed guide (from our follow-up project). If something is missing or doesn't work using the instructions from there, let me know!

wt123u commented 6 years ago

Hey Vered, I am very glad to see your response, I'm going to try it.

wt123u commented 6 years ago

Hi @vered1986 , I want to know that how to create data set . it's referred that distant supervision was used in the paper, can you provide some details ? I need to extract a specific domain knowledge base from yago.

vered1986 commented 6 years ago

Hi @wt123u,

The dataset is available here. If you want to create a different dataset in the same matter, you can download the triplet files from the various resources (e.g Yago), and select the pairs of entities that are connected by relevant properties. We selected these properties from our previous work, and used those stated in the paper. We used other properties as negative examples. Finally, to make sure that the dataset examples have paths in the corpus (which you may or may not want to do), you can use this script. I hope this helps.

wt123u commented 6 years ago

Hi @vered1986 The dataset which is contained in dataset_rnd/ or dataset_lex/ is origional or filtered ? I can't understand the lexical splitting with dataset.

wt123u commented 6 years ago

There is some difference between you afford path and generated by parsed_wikipedia.py, yours like 'X/NOUN/nsubj/>_from/ADP/ROOT/^_another/DET/det/V_Y/NOUN/pobj/<' and generated like 'X/PROPN/compound>park/PROPN/ROOT<Y/PROPN/compound', the direction's location is different.

vered1986 commented 6 years ago

Filtered. Lexical split means that the train, test, and validation sets consist of distinct vocabularies (e.g if "cat" is in the training set, it can't be in the validation or test sets).

Right, sorry about that, the first format (with the direction as a fourth component after the slash) is the updated one. For consistency I suggest that you use the LexNET code which is more updated, unless you want to reproduce the results from the HypeNET paper (and then you can't use the LexNET processed corpus and pretrained models).

wt123u commented 6 years ago

Hi @vered1986 Oh , I see, thank you so much. You mean that the corpus afforded as resource V1 or V2 was processed by LexNet ? My task is to identify hypernymy in specific domains, i think that using the HypeNet will be better, do you think so ? I am going to reproduce the answer of paper, but the result was much worse. As follows or pictures: rnd_path means dataset using dataset_rnd, method using path_based.

result_experience ]()

vered1986 commented 6 years ago

You may use the LexNET code, you should get comparable (although not identical) numbers. There are not so many changes, and it may be that I forgot to add some of the updates to the HypeNET repository.

These numbers are way too low for the path-based method, clearly it's getting some pairs randomly correct. Based on the other open issue, I can guess your corpus file is not built correctly. I'm closing this issue for now until we solve the other one.