Tokenizer does not return lowercase tokens when lowercase = True

proycon / python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

29 stars 5 forks source link

Tokenizer does not return lowercase tokens when lowercase = True #8

Closed martijnbentum closed 4 years ago

martijnbentum commented 4 years ago

When I call tokenizer with lowercase True, the output contains tokens with uppercase.

t = ucto.Tokenizer("tokconfig-nld",lowercase = True,sentencedetection=False,paragraphdetection=False)
ucto: textcat configured from: /vol/customopt/lamachine.stable/share/ucto/textcat.cfg

z = x.article_set.all()[0]

t.process(z.text)

[str(token) for token in t]

["'", 'oor', 'onze', 'redacteur', 'mr.', 'F.', 'KUITENBROUWER', 'AMSTERDAM',

proycon commented 4 years ago

That indeed looks like a clear bug. I'll investigate!

proycon commented 4 years ago

This should be fixed now in v0.5.2 !

martijnbentum commented 4 years ago

thanks