quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.77k stars 651 forks source link

Add shingle token filter or token n-grams #1204

Open fmassot opened 2 years ago

fmassot commented 2 years ago

I thought this was present in tantivy but for now, there is only a NgramTokenizer that tokenizes words into the n-grams.

Lucene offers a ShinglerFilter which creates shingles or token n-grams, it creates combinations of tokens and not letters.

For example, this dataset published token n-grams and that would be interesting to index it with tantivy instead of having some SQL dump.

mocobeta commented 2 years ago

Hi, I was also trying to implement a shingle filter. I left a PR, but it's incomplete - I tried to explain where I've stuck in the description.

fulmicoton commented 2 years ago

@fmassot I am not sure I understand how the shingle filter could help with the ngram dataset.

fmassot commented 2 years ago

@fulmicoton ah yes, that was not clear, my idea was to be able to process directly articles contents not the ngram dataset which is there because of legal constraints.