Include some pre-packaged NLP tools

amir-zeldes commented 6 years ago

e.g. make a builtin tokenizer addressable not as an external REST API

lgessler commented 6 years ago

NLTK has several tokenizers that we could allow users to choose from using a line in the config

amir-zeldes commented 6 years ago

One issue with NLTK is that it's not XML preserving: if users need to be able to transform data to spreadsheet mode, we need a tokenizer that produces TT-SGML (or we offer different ways of transforming to spreadsheets). The TreeTagger tokenizer does this, but is in native Perl (this is what GU GitDox currently uses via a service call). But I recently ported this tokenizer to Python here:

https://github.com/amir-zeldes/HebPipe/blob/master/lib/whitespace_tokenize.py

This could be a candidate for a generic tokenizer which preserves XML, outputs TT format, and you can plug different abbreviation files to match language specific abbreviations not to split.

gucorpling / gitdox

Include some pre-packaged NLP tools #95