tarsqi / ttk

Tarsqi Toolkit
Apache License 2.0
25 stars 10 forks source link

Split preprocessor #12

Closed marcverhagen closed 7 years ago

marcverhagen commented 8 years ago

Maybe split preprocessor into separate components: TOKENIZER, TAGGER, LEMMATIZER and CHUNKER. This is because in the future we may be looking at a situation where input to TTK is going to have tokens and tags, but probably no chunks. It is also useful for the i2b2 data, which come tokenized, but not tagged or chunked.

marcverhagen commented 7 years ago

Done in https://github.com/tarsqi/ttk/commit/937cba99a99ac884ece303a392464abd61fd8275 by defining wrappers for TOKENIZER, TAGGER and CHUNKER. There is no use case yet for a lemmatizer and for now this will be included in the tagger. The old PREPROCESSOR component is still around and is still the default. Would like to test this a little bit more and will keep the issue open for a while.