patrickfrey / strusAnalyzer

Library for document analysis (segmentation, tokenization, normalization, aggregation) with the goal to get a set of items that can be inserted into a strus storage. Also some functions for analysing tokens or phrases of the strus query are provided.
http://www.project-strus.net
Mozilla Public License 2.0
3 stars 0 forks source link

tokenizer 'word' is not logical #47

Open andreasbaumann opened 7 years ago

andreasbaumann commented 7 years ago

It separates don't into the tokens don and t. Of course, I can work around with the regex tokenizer...

patrickfrey commented 7 years ago

The pattern matching with strusPattern would be more efficient. The word tokenizer extracts sequences of characters of alphabets in any language (that's the idea, curretly it's just a lot of languages) represented in Unicode. What is a word or not can't be solved on tokenization level for most of the languages. NLP is the solution here, but it does not exist yet as integrated part of strus.

It is too much dependent on grammar and context. I contradict myself here because there exists a sentence delimiter tokenizer for which applies the same.

patrickfrey commented 7 years ago

The word tokenizer extract sequences of alphabetic characters of any language. A renaming of the tokenizer could be an option, but I do not know a better name.