vered1986 / HypeNET

Integrated path-based and distributional method for hypernymy detection
Other
85 stars 13 forks source link

parse_wikipedia.py produces a very large file with a newer version of spacy #2

Closed vered1986 closed 7 years ago

vered1986 commented 7 years ago

The original corpus in the paper was processed using spacy version 0.99. Using a newer spacy version creates a much larger triplet file (over 11TB, while the original file was ~900GB). For now the possible solutions are:

  1. Use spacy version 0.99 - install using: pip install spacy==0.99 python -m spacy.en.download all --force

  2. Limit parse_wikipedia.py to a specific vocabulary as in LexNET.

I'm working on figuring out what happens in the newer spacy version, and writing a memory-efficient version of parse_wikipedia.py, in case the older spacy version is the buggy one, and the number of paths should in fact be much larger.

Thanks @christos-c for finding this bug!

vered1986 commented 7 years ago

Resolved in v2.

The newer version of spacy extracts many more noun chunks, and in addition the old version of HypeNET had a minor bug of extracting the noun chunks for the entire paragraph for each sentence (probably causing some duplicates). Together, it causes many duplicates. This has been resolved in v2.

gossebouma commented 7 years ago

I was using the parse_wikipedia.py version included in LexNet and noticed that for some x, y pairs that only occur in a single sentence in my test corpus, sometimes multiple identical triples are added. It seems to be related to (the treatment of) coordination. Simply filtering duplicate triples per sentence solves this problem, and leads to approx 15-20% reduction in triples files.

vered1986 commented 7 years ago

Thanks for the input!

I think I know how to fix it, but it might take me a while until I have time to check it since I'm away from the office. If you checked the fix and it worked well, would you mind creating a pull request / write here the changes you've made to the code and I will push them? Thank you very much!