Fix ml tokenizer - Githubissues

DavidNemeskey commented 7 years ago

The ML tokenizer output incorrect results (the last character of each token was separated) if the text started with an empty line (possibly other whitespace configurations could induce the error as well). This PR fixes this bug.

I chose to fix the issue by not trimming whitespaces from the original text. An equally valid solution would be to keep trimming, but make sure that the trimmed text throughout the code. My reasons for the first option were that

the QunToken tokenizer also preserves whitespaces at the beginning of the text
trimming would be unsafe until components later in the chain are modified to not access the original document, only the tokens (see #12).

DavidNemeskey commented 7 years ago

I will merge this issue at the end of the day if I receive no input until then.

sassbalint commented 7 years ago

I guess, it's OK.

nytud / hunlp-GATE

Fix ml tokenizer #18