Arabic language support

Hadryan commented 2 years ago

All varieties of Arabic combined are spoken by perhaps as many as 422 million speakers (native and non-native) in the Arab world, making it the fifth most spoken language in the world. (Wikipedia)

fulmicoton commented 2 years ago

Of course we'd love to have Arabic supported in tantivy! (I am not sure what was the point of the wikipedia quote !?) tantivy approach here is to make it possible to implement tokenizers as an external crate.

This is a way for us to make the project as light as possible, and also avoid making uneducated choices on which tokenizer is the best on behalf of the user.

@Hadryan if you (or anyone) would like to contribute a tokenizer for Arabic and Farsi, you can have a look at what such a crate looks like for Japanese https://github.com/lindera-morphology/lindera-tantivy.

You can also join us on discord if you need help. I'd be happy to add a link from tantivy's readme to your tokenizer like it is done for other language.

I am not closing this ticket right away, mostly to see if people want to comment and offer to collaborate on building such a tokenizer.

mustafa0x commented 1 year ago

Does this document indicate that Arabic is now supported?

https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html#variant.Arabic

fulmicoton commented 1 year ago

That's for stemming. It is listed there because the underlying library supports arabic stemming, but we don't have a tokenizer for it.

mustafa0x commented 1 year ago

Silly question, but isn't tokenizing just splitting by space (in most cases, at least)? I tested with some Arabic and everything seemed to work out of the box.

PSeitz commented 1 year ago

Yes, splitting by whitespace will cover most cases. Not covered cases would e.g. compound words or stemming

quickwit-oss / tantivy

Arabic language support #1417