'\u2028' not recognized in SpacesAfter

ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files

Mozilla Public License 2.0

359 stars 75 forks source link

'\u2028' not recognized in SpacesAfter #103

Open maxtrem opened 5 years ago

maxtrem commented 5 years ago

We used the tagger and tokenizer of UDpipe. In some of our files we had this newline character '\u2028' which wasn't recognized as one. This led to further errors in other programs in our pipeline, but also to tokenization problems in UDpipe itself: For example:

17 out. What out. what PRON WP PronType=Int _ _ _ _

Where '\u2028' is just placed after the end of the sentence.

So it would be really cool if you could add this character to the list of newline characters.

foxik commented 5 years ago

Good catch, the tokenizer does not consider '\u2028' to be a newline character. Furthermore, we do not recognize '\u2029' as well -- we should fix both.

We might even consider adding a new escaping characters to SpacesAfter, even if ConLL-U documentation states that only LF is used as line separator, some tools might split on \u202[89]. regardless. But maybe not... I will think about it.

maxtrem commented 5 years ago

Thank you for your reply! Yes, we actually used the UUParser and it does split on '\u2028' and crashes. So escaping would definitely help in that regard.

foxik commented 5 years ago

Thanks for the feedback, escaping it is then :-)