vered1986 / HypeNET

Integrated path-based and distributional method for hypernymy detection
Other
85 stars 13 forks source link

TypeError: String or Integer object expected for key, unicode found #5

Closed wt123u closed 6 years ago

wt123u commented 6 years ago

Hi @vered1986, when run create_resource_from_corpus.sh line 55, to convert the textual triplets to triplets of IDs. There is a error in create_resource_from_corpus_2.py line 45, which is TypeError: String or Interger object expected for key, unicode found.

vered1986 commented 6 years ago

Line 45 in that file is a blank line. Did you change this script? Are you using python 2 or 3? You should be using 2.

wt123u commented 6 years ago

Sorry,it's seems like that the script was changed. Because x_id and y_id can't be find in the term_to_id.db file, so the problem was solved by adding judgment.

vered1986 commented 6 years ago

This shouldn't happen, since create_resource_from_corpus_1.py takes care of saving all the terms. Have you looked into the db file? Can you try to open it in the python shell and read from it, e.g. looking for specific terms that must be there (e.g. cat) or counting the number of terms?

wt123u commented 6 years ago

Because the wiki_dump file was too large, it spends much time to generate corpus, so i terminate it and adopt 138MB data to parse. Using the small data was to ensure that the program can be ran with no error. And i saw the generated files, there are special signals in the terms file (e.g &). To compare with the results of papers, I used the resource v1 as corpus. Maybe the problem was here.

wt123u commented 6 years ago

The latest wiki dump file was download, using the Wiki Extractor process the file to obtain the plain text. the plain text was the input to the program. Is it necessary to create a vocabulary file which was used in LexNet ? How long did it take you to process the corpora?

vered1986 commented 6 years ago

It is necessary. You can use the GloVe vocabulary, from here. In general, you should follow the instructions here.

It takes a while, I don't remember exactly. A matter of hours up to a day, I think. If you have a server with multiple cores, change the script to parallelize the process more that it does now.

wt123u commented 6 years ago

It takes about 3 days to process the corpus and the parsed path was so large , although I used the server with multiple cores. I saw that there different paths in the same term pair, it generated by the same sentence, is something wrong with it ?
The another one I can't understand that in the LexNet/corpus/parsewikiped.py line 67, when get all terms, the token.pos not only in 'NN' but also 'VERB' and 'ADJ', is there some special resons ?

wt123u commented 6 years ago

I count the number which convert the (x, y) to (x_id, y_id) , (x, y) comes from the dataset_lex, x_s represent the x can be found in the term_to_id.db, as sames as y_s and x_y . The counting result as follows: x_s : 18530 y_s : 26872 x_y : 17107 dataset_key : 28295
it's decrease 11188 data to convert dataset's term to id. Is the situation normal? I used the HypeNet. And the empty path counting as follows (train_intergrated.py in dataset_rnd): train_integrated_rnd

vered1986 commented 6 years ago

It takes about 3 days to process the corpus and the parsed path was so large , although I used the server with multiple cores.

Which version of spacy are you using? I find the > 2 versions to be much slower than the old ones. You can try using spacy 1.6, which is the version on which this code was tested.

I saw that there different paths in the same term pair, it generated by the same sentence, is something wrong with it ?

Probably not, but I need to see an example to be sure - there are paths with and without satellites. If the same path is generated more than once for the same term-pair and the same sentence, that's a problem.

The another one I can't understand that in the LexNet/corpus/parsewikiped.py line 67, when get all terms, the token.pos not only in 'NN' but also 'VERB' and 'ADJ', is there some special resons ?

I wanted to be able to use it for all content words (nouns, verbs, adjectives), as some of the datasets I tested contain other POS than nouns. If you only need nouns, change the script, and it will run a bit faster.

I count the number which convert the (x, y) to (x_id, y_id) , (x, y) comes from the dataset_lex, x_s represent the x can be found in the term_to_id.db, as sames as y_s and x_y . The counting result as follows: x_s : 18530 y_s : 26872 x_y : 17107 dataset_key : 28295 it's decrease 11188 data to convert dataset's term to id. Is the situation normal? I used the HypeNet. And the empty path counting as follows (train_intergrated.py in dataset_rnd):

As a general comment, the HypeNET corpus was processed for all noun phrases in Wikipedia, and I've made sure that all the term-pairs in the dataset also appear together in Wikipedia (i.e. have paths). In LexNET I've done it differently: I've parsed Wikipedia for all the nouns, verbs and adjectives in the GloVe vocabulary. This way it can be used for various different datasets, but notice that if you use it for HypeNET, it won't contain many of the terms, especially the MWEs. If your goal is to reproduce the results on the HypeNET dataset, you should provide the HypeNET dataset vocabulary to the version of parse_wikipedia.py in LexNET. I suggest that you provide it in addition to the general vocabulary file so that you wouldn't have to parse it again for every dataset.

wt123u commented 6 years ago

I saw a issue that much large paths was generated with the new version of spacy, so I use the spacy 0.99. It's difficult for me to process so large corpus, it's about 800G after parsed_wikipedia.py . So I use the provided resource V1 and V2 to do experience, but the result was Unsatisfactory.

About the same term pair generates different paths, there are paths with satellites. I discover that there are different between HypeNet and LexNet, the LexNet is more comprehensive than HypeNet.

You mean that there are many term pairs in datasets but not in the general vocabulary file, so if i use the resouce V1 or V2 to do experience, the results was not comparable. Because the processed corpus was so large, the memory was full and my process always be killed by server when running the script.
In other words, the problem was caused by the corpus. So I need to generate the right corpus. But if add the dataset vocabulary to general vocabulary file, and try it again. Thank you so much.

wt123u commented 6 years ago

It seems something wrong with the LexNet/corpus/parse_wikipedia.py in line 129 and 137. if hx==[ ] means that x is the root, because the heads(x) returns the path which from root to x.head. So I make changes as follow : inkedlexnet_parse_wikipedia

vered1986 commented 6 years ago

Correct, LexNET's resource V1 and V2 are processed for the GloVe vocabulary. They don't contain most of HypeNET's vocabulary, so they are expected to perform poorly on the HypeNET dataset.

I didn't understand the new issue with heads(x). If you think there is a bug, please open a separate issue with a minimal example that shows the problem and I'll fix it. Thanks.