tokenizer 'word' is not logical

andreasbaumann commented 7 years ago

It separates don't into the tokens don and t. Of course, I can work around with the regex tokenizer...

patrickfrey commented 7 years ago

The pattern matching with strusPattern would be more efficient. The word tokenizer extracts sequences of characters of alphabets in any language (that's the idea, curretly it's just a lot of languages) represented in Unicode. What is a word or not can't be solved on tokenization level for most of the languages. NLP is the solution here, but it does not exist yet as integrated part of strus.

It is too much dependent on grammar and context. I contradict myself here because there exists a sentence delimiter tokenizer for which applies the same.

patrickfrey commented 7 years ago

The word tokenizer extract sequences of alphabetic characters of any language. A renaming of the tokenizer could be an option, but I do not know a better name.

patrickfrey / strusAnalyzer

tokenizer 'word' is not logical #47