quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.59k stars 639 forks source link

Ngram + Stemmer combination #2303

Open ctron opened 7 months ago

ctron commented 7 months ago

Using ngram in combination with the stemmer seems to create weird results. Considering the following setup:

Using tantivy: 0.21.0

let ngram = NgramTokenizer::all_ngrams(3, 8).unwrap();
let mut text = TextAnalyzer::builder(ngram)
  .filter(RemoveLongFilter::limit(40))
  .filter(LowerCaser)
  .filter(Stemmer::new(Language::English))
  .build();

Putting in the text September October turns this into:

List of tokens ``` sep sept sept septem septemb septemb ept ept eptem eptemb eptemb eptemb pte ptem ptemb ptemb ptember ptember tem temb temb tember tember tember o emb emb ember ember ember o ember oc mbe mber mber mber o mber oc mber oct ber ber ber o ber oc ber oct ber octo er er o er oc er oct er octo er octob r o r oc r oct r octo r octob r octob oc oct octo octob octob octob oct octo octob octob octob cto ctob ctobe ctober tob tobe tober obe ober ber ```

I would somehow expect to have this split into September, October and then have the processing on the individual tokens.

PSeitz commented 7 months ago

Do you have a reference for a ngram tokenizer that ends the ngram on whitespace?

RemoveLongFilter::limit(40) doesn't make sense, a ngram token will never have that length. .filter(Stemmer::new(Language::English)) will give unexpected results.

ctron commented 7 months ago

Do you have a reference for a ngram tokenizer that ends the ngram on whitespace?

The example above?

.filter(Stemmer::new(Language::English)) will give unexpected results

Yea, I noticed that :D

Maybe the approach is wrong? Maybe I need something like:

SimpleTokenizer -> RemoveLongFilter -> (Ngram).chain(Stemmer)

I am just not sure how to get there.

PSeitz commented 7 months ago

I meant a reference that does the tokenization in September, October you suggested.

I am just not sure how to get there.

I'm not sure TextAnalzyer can do that currently. You could write you own Tokenizer.

ctron commented 7 months ago

I meant a reference that does the tokenization in September, October you suggested.

That's the SimpleTokenizer one. It gives me:

september
october
PSeitz commented 7 months ago

SimpleTokenizer is not an ngram tokenizer

ctron commented 7 months ago

No it is not. I am sorry, but then I don't understand your question.

ctron commented 7 months ago

So I can guess I can come close to that by somehow reversing the API:

let ngram = NgramTokenizer::all_ngrams(3, 8).unwrap();
let mut text = TextAnalyzer::builder(
    Stemmer::new(Language::English)
        .transform(LowerCaser.transform(RemoveLongFilter::limit(40).transform(SimpleTokenizer::default())))
        .chain(LowerCaser.transform(RemoveLongFilter::limit(40).transform(SimpleTokenizer::default())))
        .chain(LowerCaser.transform(ngram)),
)
.build();

That still gives me things like ber octo, but mostly works.

adamreichold commented 7 months ago

Trying to paraphrase what PSeitz is trying to say: This is the expected behaviour of what is generally called an ngram tokenizer, i.e. it will not care about whitespace. The question for a reference would be some other system, like Lucene/Elasticsearch, which provides such an ngram tokenizer because having a reference would tell us a) such an API part of the state of art and b) give us hints how to add it here if we wanted to.

But indeed in this case, what you are looking for is build your own Tokenizer. Probably by wrapping a TextAnalyzer based on SimpleTokenizer, Stemmer, etc. and then applying NgramTokenizer to the token resulting from that so that you end up with the n-grams of "septemb" and "octob" instead of those of "september october".

ctron commented 7 months ago

Trying to paraphrase what PSeitz is trying to say: This is the expected behaviour of what is generally called an ngram tokenizer, i.e. it will not care about whitespace. The question for a reference would be some other system, like Lucene/Elasticsearch, which provides such an ngram tokenizer because having a reference would tell us a) such an API part of the state of art and b) give us hints how to add it here if we wanted to.

I think there is: https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html

The idea we have in mind is to search for sub-words. Like having a text containing SomeHttpConnection, searching for http to find it.

I think it makes sense to combine multiple tokenizers (chaining) to have it split into words (like the Simple one) but then also doing ngrams and stemmers individually on those.

It feels like all of those components are there, but are sometimes implemented a Tokenizers, then as TokenFilter and then the TextAnalyzer. But it doesn't seem to possible to compose the desired behavior as they APIs don't work well together.

Ideally I would want to create some pipeline, like mentioned above.

adamreichold commented 7 months ago

The idea we have in mind is to search for sub-words. Like having a text containing SomeHttpConnection, searching for http to find it.

In that case, you might want to look at SplitCompoundWords which will split based on user-supplied dictionary. This can be more efficient compare to the more brute-force approach of using n-grams but its success depends entirely on the quality of the dictionary.

(In this particular example, you might actually want to build a TokenFilter that splits camel case identifiers but I am not sure whether this encompasses your whole use case.)

ctron commented 7 months ago

In that case, you might want to look at SplitCompoundWords which will split based on user-supplied dictionary. This can be more efficient compare to the more brute-force approach of using n-grams but its success depends entirely on the quality of the dictionary.

Unfortunately, we don't have a dictionary. So that' doesn't really work well.

(In this particular example, you might actually want to build a TokenFilter that splits camel case identifiers but I am not sure whether this encompasses your whole use case.)

I guess that would actually be one way to deal with this. I think it would be great to have more tooling around composing tokenizers and filters. I raised a PR to chain two (or more) tokenizers: https://github.com/quickwit-oss/tantivy/pull/2304 … I believe that's generic enough.