The ML tokenizer output incorrect results (the last character of each token was separated) if the text started with an empty line (possibly other whitespace configurations could induce the error as well). This PR fixes this bug.
I chose to fix the issue by not trimming whitespaces from the original text. An equally valid solution would be to keep trimming, but make sure that the trimmed text throughout the code. My reasons for the first option were that
the QunToken tokenizer also preserves whitespaces at the beginning of the text
trimming would be unsafe until components later in the chain are modified to not access the original document, only the tokens (see #12).
The ML tokenizer output incorrect results (the last character of each token was separated) if the text started with an empty line (possibly other whitespace configurations could induce the error as well). This PR fixes this bug.
I chose to fix the issue by not trimming whitespaces from the original text. An equally valid solution would be to keep trimming, but make sure that the trimmed text throughout the code. My reasons for the first option were that