slanglab / phrasemachine

Quickly extract multi-word phrases from a corpus
http://slanglab.cs.umass.edu/phrasemachine/
MIT License
190 stars 26 forks source link

Unexpected behavior with custom regex + pronouns from Ark #21

Open AbeHandler opened 5 years ago

AbeHandler commented 5 years ago

This is a weird corner case, but worth noting and perhaps fixing. If you are using the library with the ARK tagger you might get pronouns tagged with "O".

Because phrasemachine marks tokens that are not in the coarsemap with "O" (i.e. other) this does weird things when you have a custom regex that involves pronoun tags.

tokens = "I drive a red car".split()
postags = "P V D A N".split()

pronouns = "O"
phrasemachine.get_phrases(tokens=tokens, postags=postags, regex=pronouns, minlen=1)["counts"]

Counter({'drive': 1})

I think an easy fix is to change "O" to another, rare-r character internally. Another option (better?) is just not to fix this.

brendano commented 5 years ago

Yes, that's always been the case. I thought I wrote a mapping system for phrasemachine that mapped both the ARK and PTB tagsets to a standardized coarse tagset, which solves this problem. If it's not in the python version, maybe it's in the R version or our earlier research versions?

brendano commented 5 years ago

Oh sorry I'm misunderstanding the question; never mind