ufal / morphodita

MorphoDiTa: Morphologic Dictionary and Tagger
Mozilla Public License 2.0
69 stars 7 forks source link

Tokenizing URLs redux #5

Closed dlukes closed 8 years ago

dlukes commented 8 years ago

URLs now allow non-ASCII characters (as discussed in #4 and fixed in f4d691a, thanks!), but a different problem has appeared -- the http:// prefix is split into separate tokens (as of current master, 5bb38a9):

$ echo 'Na adrese http://www.karaoketexty.cz/plíhal je dostupný...' | ./run_tokenizer --tokenizer czech --output vertical
Na
adrese
http
:
/
/
www.karaoketexty.cz/plíhal
je
dostupný
.
.
.

Perhaps this is in the process of being addressed, in which case don't mind me :)

foxik commented 8 years ago

Confirming this is really a bug :-) I will post a fix soon.