Open amir-zeldes opened 6 years ago
NLTK has several tokenizers that we could allow users to choose from using a line in the config
One issue with NLTK is that it's not XML preserving: if users need to be able to transform data to spreadsheet mode, we need a tokenizer that produces TT-SGML (or we offer different ways of transforming to spreadsheets). The TreeTagger tokenizer does this, but is in native Perl (this is what GU GitDox currently uses via a service call). But I recently ported this tokenizer to Python here:
https://github.com/amir-zeldes/HebPipe/blob/master/lib/whitespace_tokenize.py
This could be a candidate for a generic tokenizer which preserves XML, outputs TT format, and you can plug different abbreviation files to match language specific abbreviations not to split.
e.g. make a builtin tokenizer addressable not as an external REST API