slanglab / phrasemachine

Quickly extract multi-word phrases from a corpus
http://slanglab.cs.umass.edu/phrasemachine/
MIT License
190 stars 26 forks source link

ARK-style twitter tags and direct universal tags #5

Open brendano opened 7 years ago

brendano commented 7 years ago

make it work for twitter. dont bother with wrapper the ark tagger, but work with calling as get_phrases(pos=..., tokens=...)

just take the bare one-character tags (Gimpel et al 2011) so no needs for the Coarse* conversion layer the old openfst/foma/pyfst version had. and while we're at it why not use the all-caps Petrov tags directly too. hopefully there are no tag system naming conflicts with all this?

backburner: see what the nltk tagset conversion systems are now (@nschneid submitted something a while back)

brendano commented 7 years ago

more TODO: coarsen_POS_tags.R needs to be updated also. current Coarse* inputs wont do anything since these codepaths never normalize to Coarse*. only the old pre-openfst prepreprocessor did that which we've ditched.

more TODO: write bilingual tests for POS coarsening

brendano commented 6 years ago

need to look into: NLTK has some tagset conversion methods now http://www.nltk.org/_modules/nltk/tag/mapping.html