Supporting tokenizer register

ghost commented 3 years ago

Currently, the tokenizer is hard-coded to default, it would be better to include some configurable tokenizer for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segmente) and Korean (lindera + lindera-ko-dic-builder)

https://github.com/tantivy-search/tantivy-py/blob/4ecf7119ea2fc5b3660f38d91a37dfb9e71ece7d/src/schemabuilder.rs#L85

ghost commented 3 years ago

@fulmicoton

Also note that tantivy-py does not come with a japanese tokenizer. Tantivy has a good and maintained tokenizer called Lindera. If you know rust, you may have to compile your own version of tantivy-py.

I am trying to add LinderaTokenizer to https://github.com/tantivy-search/tantivy-py/blob/master/src/schemabuilder.rs#L85, but I couldn't figure out where

 index
         .tokenizers()
         .register("lang_ja", LinderaTokenizer::new("decompose", ""));

should go. Do you have any idea?

fulmicoton commented 3 years ago

Anywhere as long as it happens before you index your documents.

Also make sure you declared in the schema that you want to use the tokenizer named "lang_ja" for your japanese fields.

fulmicoton commented 3 years ago

Note that these tokenizer typically required shipping a dictionary that is several MB large, so it will not be shipped by default. Ideally that should be in a different python package, and registration of the tokenizer should be done by the user as suggested by @acc557

zhangchunlin commented 2 years ago

What's the progress of adding support configurable tokenizer like tantivy-jieba? This is badly needed for non-ascii text indexing.

fulmicoton commented 2 years ago

I don't have time to work on this but any help is welcome.

zhangchunlin commented 2 years ago

Could you provide some directions/suggestions we can try? I am willing to do something for this. Thank you~

adamreichold commented 1 year ago

I think a useful approach might be to add optional features to this crate which can be enabled when building it from source using Maturin to include additional tokenizers. Not sure how to best integrate this with pip's optional dependency support though...

cjrh commented 11 months ago

I will look at this within the next two weeks or so.

cjrh commented 7 months ago

For (my own) future reference the upstream tantivy docs for custom tokenizers is here.

cjrh commented 7 months ago

I've started working on this in a branch here (currently incomplete): https://github.com/cjrh/tantivy-py/tree/custom-tokenizer-support

I think it will be possible to add support via features as suggested. We could also consider making builds that include support, just to make it a bit easier for users who might not have or want a rust toolchain. But we'll have to be careful about combinatorial explosion of the builds. Perhaps we'll limit the platforms for the "big" build for example.

cjrh commented 7 months ago

I've done a bit more work and put up my PR in draft mode #200 . I will try to add tantivy-jieba in a similar way under fflag in the next batch of work I get around to.

The user will have to build the tantivy-py wheel with the additional build-args="--features=lindera" setting. (The tests demonstrate this.)

I've added a small Python test that shows the "user API" of enabling Lindera. We could decide that if the build is a Lindera build, then it should not be necessary to manually register the lindera tokenizer, as below:

def test_basic():
    sb = SchemaBuilder()
    sb.add_text_field("title", stored=True, tokenizer_name="lang_ja")
    schema = sb.build()
    index = Index(schema)
    index.register_lindera_tokenizer()
    writer = index.writer(50_000_000)
    doc = Document()
    doc.add_text("title", "成田国際空港")
    writer.add_document(doc)
    writer.commit()
    index.reload()

What is the user expectation of whether or not something like register_lindera_tokenizer() should be called?

Also, there are things that seem like settings in the configuration of the tokenizer itself (what's "mode"?). ~And finally, the examples in the README at https://github.com/lindera-morphology/lindera-tantivy show use of TextOptions, which means we probably need support for that in tantivy-py?~ (already done)

quickwit-oss / tantivy-py

Supporting tokenizer register #25