slanglab / phrasemachine

Quickly extract multi-word phrases from a corpus
http://slanglab.cs.umass.edu/phrasemachine/
MIT License
190 stars 26 forks source link

incorrect counts (for a longish text) #11

Closed jeremybmerrill closed 7 years ago

jeremybmerrill commented 7 years ago

Hi!

I'm giving this a try for keyphrase extraction, at @AbeHandler's recommendation. The results are really promising on the sentences I've tested it with; as an erstwhile linguistics major, I really like the application of linguistics to NLP tasks.

However, I'm getting some sort of funny results applying the get_phrases method to multi-sentence texts. Consider this (contrived) example:

>>> phrasemachine.get_phrases("Social security is a law. Gravity is one too.")
{'counts': Counter({'law. gravity': 2, 'social security': 2}), 'num_tokens': 20}

I'm a bit puzzled by this. Is it possible that it's finding phrases in the text once per sentence in the text?

>>> phrasemachine.get_phrases("Social security is a law. Gravity is one too. Cheeseburgers are tasty.")
{'counts': Counter({'law. gravity': 3, 'one too.': 3, 'social security': 3, 'one too. cheeseburgers': 3, 'too. cheeseburgers': 3}), 'num_tokens': 39}

Expected result would have only 1 count for social security and, ideally, not the tokens that span the end of the sentence.

AbeHandler commented 7 years ago

Hi Jeremy,

thanks for bringing this up and thanks for trying out the software.

tl;dr $pip install -U phrasemachine should fix it. If not, please let me know.

Long boring version:

This seems like it was a problem w/ pip packaging. I added a unit test to check for your example and was not able to replicate the bug locally. (For record keeping, to run the new test, do $cd phrasemahine; git pull; pytest tests/unittests.py. Because I did not see the problem locally I pushed a new version of the package to pip, which seems to fix it. So as far as I can tell, the cause of the bug was that pip had an old version of the file.

Thanks again for pointing this out.

bad
jeremybmerrill commented 7 years ago

@AbeHandler cool, works perfect now. Thanks!