Open maxtrem opened 5 years ago
Good catch, the tokenizer does not consider '\u2028' to be a newline character. Furthermore, we do not recognize '\u2029' as well -- we should fix both.
We might even consider adding a new escaping characters to SpacesAfter, even if ConLL-U documentation states that only LF is used as line separator, some tools might split on \u202[89]. regardless. But maybe not... I will think about it.
Thank you for your reply! Yes, we actually used the UUParser and it does split on '\u2028' and crashes. So escaping would definitely help in that regard.
Thanks for the feedback, escaping it is then :-)
We used the tagger and tokenizer of UDpipe. In some of our files we had this newline character '\u2028' which wasn't recognized as one. This led to further errors in other programs in our pipeline, but also to tokenization problems in UDpipe itself: For example:
17 out. What out. what PRON WP PronType=Int _ _ _ _
Where '\u2028' is just placed after the end of the sentence.
So it would be really cool if you could add this character to the list of newline characters.