nytud / hunlp-GATE

Lang_Hungarian - a GATE plugin containing Hungarian NLP tools as GATE processing resources
GNU General Public License v3.0
8 stars 6 forks source link

Fix ml tokenizer #18

Closed DavidNemeskey closed 7 years ago

DavidNemeskey commented 7 years ago

The ML tokenizer output incorrect results (the last character of each token was separated) if the text started with an empty line (possibly other whitespace configurations could induce the error as well). This PR fixes this bug.

I chose to fix the issue by not trimming whitespaces from the original text. An equally valid solution would be to keep trimming, but make sure that the trimmed text throughout the code. My reasons for the first option were that

DavidNemeskey commented 7 years ago

I will merge this issue at the end of the day if I receive no input until then.

sassbalint commented 7 years ago

I guess, it's OK.