This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).
29
stars
5
forks
source link
Tokenizer does not return lowercase tokens when lowercase = True #8
When I call tokenizer with lowercase True, the output contains tokens with uppercase.
t = ucto.Tokenizer("tokconfig-nld",lowercase = True,sentencedetection=False,paragraphdetection=False)
ucto: textcat configured from: /vol/customopt/lamachine.stable/share/ucto/textcat.cfg
z = x.article_set.all()[0]
t.process(z.text)
[str(token) for token in t]
["'", 'oor', 'onze', 'redacteur', 'mr.', 'F.', 'KUITENBROUWER', 'AMSTERDAM',