This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).
Hi there, I noticed that
isendofquote
seems to be broken.Seems like a typo on this line:
https://github.com/proycon/python-ucto/blob/65a7f03a92f60fa28e330a5fb735d75230cdbec4/ucto_wrapper.pyx#L29
which should be rather
ENDOFQUOTE
.