kaykay-dv / pocketsearch

A simple full-text search library for Python using SQLite and its FTS5 extension
https://pocketsearch.readthedocs.io/en/latest/
MIT License
1 stars 0 forks source link

Custom tokenizers #47

Open kaykay-dv opened 1 year ago

kaykay-dv commented 1 year ago

The FTS5 engine comes with several tokenizers, e.g. Unicode61 or a tokenizer that implements a Porter stemmer. It would be great to support fully customizable tokenizers in pocketsearch. However, this leads to the problem that one has to implement a custom sqlite3 extension writing the tokenizer in plain C. It would be great to have a Python-only option for implementing a tokenizer in PocketSearch. Quick research revealed that this is not really possible and options are limited to what tokenizers offer as configuration (e.g. separation characters, token characters, etc.). Having emoticons recognized as tokens is even hard to implement using these options leading to unwanted side effects.