toshi-search / Toshi

A full-text search engine in rust
MIT License
4.18k stars 130 forks source link

Is there any way to use a custom tokenizer? #791

Open dzcpy opened 3 years ago

dzcpy commented 3 years ago

Is your feature request related to a problem? Please describe. https://github.com/tantivy-search/tantivy#features One of the features tantivy provides is to support custom tokenizers. For example tantivy-jieba. Is it possible for Toshi to support this feature?

Does another search engine have this functionality? Can you describe it's function?

Do you have a specific use case you are trying to solve?

Additional context

hntd187 commented 3 years ago

That would require you to have to build toshi with that support right? I suppose we could start conditionally including them and have releases that include tokenizers. Do you think that would solve your use?

dzcpy commented 3 years ago

@hntd187 Yes, that would be very helpful. Thanks for your awesome work, This project seems very promising

hntd187 commented 3 years ago

What tokenizers specifically would you like to see included? I know the one you linked hasn't been updated in some time and is 2 versions behind on tantivy version so I do not know if it will work anymore.

hntd187 commented 3 years ago

I added in https://github.com/toshi-search/Toshi/blob/master/toshi-server/src/lib.rs#L55 the ability to conditionally add tokenizer cang_jie if you build toshi with it. If you want we can add more tokenizers, I'll probably come up with some more general traits to make this impl easier for me in the future.

dzcpy commented 3 years ago

@hntd187 Thanks very much, I think that's pretty much what I need. In the future there might people want to use other tokenizers like Japanese and Korean ones though, but for me I only need Chinese.