Open brendano opened 7 years ago
more TODO: coarsen_POS_tags.R needs to be updated also. current Coarse* inputs wont do anything since these codepaths never normalize to Coarse*. only the old pre-openfst prepreprocessor did that which we've ditched.
more TODO: write bilingual tests for POS coarsening
need to look into: NLTK has some tagset conversion methods now http://www.nltk.org/_modules/nltk/tag/mapping.html
make it work for twitter. dont bother with wrapper the ark tagger, but work with calling as
get_phrases(pos=..., tokens=...)
just take the bare one-character tags (Gimpel et al 2011) so no needs for the Coarse* conversion layer the old openfst/foma/pyfst version had. and while we're at it why not use the all-caps Petrov tags directly too. hopefully there are no tag system naming conflicts with all this?
backburner: see what the nltk tagset conversion systems are now (@nschneid submitted something a while back)