Open andreasbaumann opened 7 years ago
The pattern matching with strusPattern would be more efficient. The word tokenizer extracts sequences of characters of alphabets in any language (that's the idea, curretly it's just a lot of languages) represented in Unicode. What is a word or not can't be solved on tokenization level for most of the languages. NLP is the solution here, but it does not exist yet as integrated part of strus.
It is too much dependent on grammar and context. I contradict myself here because there exists a sentence delimiter tokenizer for which applies the same.
The word tokenizer extract sequences of alphabetic characters of any language. A renaming of the tokenizer could be an option, but I do not know a better name.
It separates
don't
into the tokensdon
andt
. Of course, I can work around with theregex
tokenizer...