korpling / ANNIS

ANNIS is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation.
http://corpus-tools.org/annis/
Apache License 2.0
67 stars 25 forks source link

Support search by characters. #824

Closed fishfree closed 1 year ago

fishfree commented 1 year ago

For Chinese / Japanese / Korean characters, there is no space between words. So it's must to search by characters other than tokens.

thomaskrause commented 1 year ago

Token is a more generic concept than "word" and using token as the smallest unit of the corpus is a fundamental design of the query language of ANNIS AQL. Adding full-text search is out of scope for this project. If you want to tokenize a Chinese corpus, you could e.g. use the Spacy NLP pipeline for Chinese, which includes tokenization: https://spacy.io/models/zh

fishfree commented 1 year ago

@thomaskrause Yes, we can import tokenized text. However, it's very unnatural for native speakers of CJK to read space-seperated text.