Closed fishfree closed 1 year ago
Token is a more generic concept than "word" and using token as the smallest unit of the corpus is a fundamental design of the query language of ANNIS AQL. Adding full-text search is out of scope for this project. If you want to tokenize a Chinese corpus, you could e.g. use the Spacy NLP pipeline for Chinese, which includes tokenization: https://spacy.io/models/zh
@thomaskrause Yes, we can import tokenized text. However, it's very unnatural for native speakers of CJK to read space-seperated text.
For Chinese / Japanese / Korean characters, there is no space between words. So it's must to search by characters other than tokens.