nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
724 stars 99 forks source link

Prevent from splitting on hyphen when doing tokenization for POS? #46

Open olegpolivin opened 2 years ago

olegpolivin commented 2 years ago

Dear community,

Is it possible to prevent Trankit to split words for POS-tagging on hyphens? For example, it splits "out-of-print materials" to "out", "-", "of", "-", "print", "materials", and then does POS on each item separately. Sometimes a word on the whole could have one POS, but if Trankit splits the words, all of them get their own POS tag. Next I need to find a way to combine all of them back and choose only one POS. It makes the whole code quite cumbersome.

Is it possible to just prevent Trankit from doing that?