meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents
MIT License
256 stars 88 forks source link

Chinese segmentation not correct #226

Open sivdead opened 1 year ago

sivdead commented 1 year ago

I notice that this program use jieba.cut to cut Chinese words,but it seems not works well at some time; egg,use Chinese word 永永远远是龙的传人,jieba.cut will result to 永永远远/是/龙的传人, but when use jieba.cut_for_search, the result would be 永远/远远/永永远远/是/传人/龙的传人, I think its better for index search.

sivdead commented 1 year ago

I can make a pr to solve this if you do think this should be fixed.

ManyTheFish commented 1 year ago

Hello @sivdead, you're right, using cut_for_search would increase the recall of Meilisearch by splitting words in different ways. However, Meilisearch relies on words position for queries, and Jieba.cut_for_search doesn't give any clues on the position of each token, moreover, charabia does not support shifting tokens. In order to support this kind of position shifting behavior, the charabia output should be changed in a tree shape for instance 永永远远是龙的传人 would be shaped as:

永永远远 ──┬─► 是 ─┬─► 龙的传人
永远 ─────┤       └─► 传人
远远 ─────┘

Which is not possible without doing a huge job, But I have to admit that it would enhance significantly the search recall.

Thank you for your report and sorry for the time to answer,

see you!