manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.68k stars 483 forks source link

Jieba integration #931

Open oabu opened 1 year ago

oabu commented 1 year ago

ICU is not a good choice in China. In addition, it is very important for Chinese word segmentation to customize the dictionary, because the application of words in different industries is completely different.
Taking jieba word segmentation as an example, he has a mode called search mode, which is specially prepared for full-text retrieval.

To this end, I made an example, please take a look and you will understand the difference.
http://[lx.host.dabai.com](http://lx.host.dabai.com/)/
the FULL result is the correct result

Taking "清华大学" as an example, few people may search for "清华大学", but most of them will use "清华" as a keyword search, so we need both "清华大学" and "清华". @sanikolaev @dzcpy

sanikolaev commented 1 year ago

@oabu Thank you for your feedback. So do you recommend adding integration with https://github.com/yanyiwu/cppjieba ?

malacca commented 1 year ago

@sanikolaev yes, i hope manticoresearch can integration with jieba, because it does not support chinese word segmentation, I temporarily choose meilisearch.

if you decide to integration with jieba, Here is a nice discussion to refer to

sanikolaev commented 1 year ago

@fxtxkktv in https://github.com/manticoresoftware/manticoresearch/issues/1137 expressed his interest in adding Jieba support into Manticore.

axhiao commented 1 year ago

there is another repo that is related to Chinese word segmentation. And it was written in C++.

https://github.com/fastcws/fastcws

sanikolaev commented 1 year ago

there is another repo that is related to Chinese word segmentation. And it was written in C++.

Jieba seems to be more popular. What are the advantages of this one? Is there any benchmark comparing it with Jieba and/or ICU?

fxtxkktv commented 1 year ago

还有另一种与中文分词有关的存储库。它是用C++编写的。

杰霸似乎更受欢迎。这个有什么优点?是否有与杰霸和/或ICU比较的基准?

【jieba】 Custom Chinese word segmentation is useful

jacentsao commented 11 months ago

@sanikolaev hi, is there any plan about using jieba as Chinese text segmentation, the most popular Chinese text segmentation is https://github.com/fxsjy/jieba and it's C++ version is https://github.com/yanyiwu/cppjieba.

sanikolaev commented 11 months ago

This issue won't make it to the upcoming release. Hopefully we'll address this issue in the next release, i.e. in a few months.

JonGates commented 10 months ago

I think jieba is the current best open source Chinese participle , support for Chinese Simplified Chinese , Chinese Traditional Chinese participle , support for customized thesaurus .

jieba supports three modes of participle : precise mode, full mode and search engine mode. Very suitable for full-text search , I used in es is also jieba @sanikolaev

oabu commented 10 months ago

@oabu 感谢您的反馈。因此,您是否建议添加与 https://github.com/yanyiwu/cppjieba ?

https://github.com/fxsjy/jieba https://github.com/yanyiwu/cppjieba

jaric commented 6 months ago

hi @sanikolaev ,

Do you have any plan or timeline regarding the full integration of Jieba?

Thanks.

sanikolaev commented 6 months ago

Hi @jaric

Unfortunately, it's not in our nearest plans, but we are still interested in it. Ideally, we'd like someone to make a pull request or sponsor the development :)

thegenius commented 5 months ago

This is very important for Chinese developer to choose Manticore。 For now, small company may choose postgresql, and big company stick to Elastic Search。 And I think Meilisearch and Manticore will be The Next Star。 Many friends of mine from startup company recommend Meilisearch, for the easy of use and Chinese support. I personally prefer Manticore for the SQL-first,but disappointed by the absent of Jieba support. This is not so hard, but absolutely important!

sanikolaev commented 5 months ago

@thegenius thanks for the comment. I've added this task to the roadmap - https://roadmap.manticoresearch.com/

xzxiaoshan commented 3 months ago

jieba 对中文来说很重要,希望早一些可以用上。

smellbee commented 1 month ago

is there any news on this topic? lot's of startup companies are waiting for this feature

sanikolaev commented 1 month ago

is there any news on this topic?

@smellbee, unfortunately, there is no significant progress on this topic yet, except that we now have a better understanding of how this can be integrated internally. Regretfully, none of those startup companies have been willing to sponsor the development. For more information, you can visit: https://manticoresearch.com/services/

smellbee commented 1 month ago

Regretfully, none of those startup companies have been willing to sponsor the development. For more information, you can visit: https://manticoresearch.com/services/

I think those startup companies are not economically guaranteed. or they are too weak now. To adopt new technical solutions is experimental and risky, so persuading them to change is not very easy. Most of them can only follow other majority's old but widely-known solutions.

but if there are some key features (which is important to their bussiness) , it might be possible to trigger them to have a try.
once they get some benefit, maybe like hardware cost reduction, or easy implementation of bussness features, I think they might have a real willing to feed back, like sponsorship. or even investment.

In my opinion, if we wanna target the market which have a large number of potential customers, this feature could be of a little importance. there are lots of bigger or giant companies focusing on Chinese market need better DB solutions, these are potential big Donors or investers. I am 100% sure if this feature released, will have some hits to draw their attention.

lgl5240 commented 1 month ago

I have been following this project for quite some time, but haven't used it because the Chinese word segmentation support was not very user-friendly. I remember that https://github.com/veelion/manticoresearch-seg provides Chinese word segmentation support. I wonder why the official team hasn't incorporated this project. https://github.com/manticoresoftware/manticoresearch/pull/175

sanikolaev commented 1 month ago

Hopefully we'll have time to integrate with Jieba in a few weeks. There are two major tasks to finish before it: