mickeysjm / HiExpan

The source code used for automatic taxonomy construction method HiExpan, published in KDD 2018
GNU General Public License v3.0
71 stars 18 forks source link

better positions for extracting skip gram feature? #6

Open JieyuZ2 opened 5 years ago

JieyuZ2 commented 5 years ago

Hi Jiaming,

In the code of extracting skip gram features https://github.com/mickeystroller/HiExpan/blob/master/src/featureExtraction/extractSkipGramFeature.py, the positions of possible skip gram are set as [(-1, 1), (-2, 1), (-3, 1), (-1, 3), (-2, 2), (-1, 2)] (line 30) , but I found when the center word is the first word of a sentence, the positions will actually become (0, 1) instead of (-1, 1) since there is no word before the center word, so maybe we should add positions like (0, 1), (0, 2) . Otherwise, we will see some entities have "a problem" feature but do not have " problem" feature. It may hurt when "_ problem" become an important feature later. Thanks!

Best, Jieyu

mickeysjm commented 5 years ago

Thanks for this comment. I initially chose to select this six possible skipgrams in order to somehow align with existing literature. You can definitely change to other positions and I think your proposed schedule is very reasonable. You can do a comparative analysis and I am looking forward to seeing some empricial results. Thanks.