gucorpling / gitdox

Repository for GitDOX, a GitHub Data-storage Online XML editor
Apache License 2.0
15 stars 4 forks source link

Include some pre-packaged NLP tools #95

Open amir-zeldes opened 6 years ago

amir-zeldes commented 6 years ago

e.g. make a builtin tokenizer addressable not as an external REST API

lgessler commented 6 years ago

NLTK has several tokenizers that we could allow users to choose from using a line in the config

amir-zeldes commented 6 years ago

One issue with NLTK is that it's not XML preserving: if users need to be able to transform data to spreadsheet mode, we need a tokenizer that produces TT-SGML (or we offer different ways of transforming to spreadsheets). The TreeTagger tokenizer does this, but is in native Perl (this is what GU GitDox currently uses via a service call). But I recently ported this tokenizer to Python here:

https://github.com/amir-zeldes/HebPipe/blob/master/lib/whitespace_tokenize.py

This could be a candidate for a generic tokenizer which preserves XML, outputs TT format, and you can plug different abbreviation files to match language specific abbreviations not to split.