jaiminpan / pg_jieba

Postgresql full-text search extension for chinese
BSD 3-Clause "New" or "Revised" License
338 stars 65 forks source link

Modify dictionaries settings #59

Closed lddfg closed 10 months ago

lddfg commented 10 months ago
  1. token x does not need to be passed through the dictionaries
  2. token eng can use the default dictionary
lddfg commented 10 months ago

Hello, I don’t know much about this project, but in my use, when I use a mixture of Chinese and English to input word segmentation, there will be a lot of ` entered into the lexemes(Others can be removed using stop words, but ` doesn’t seem to work.). I am a bit irresponsible and think that token x means blank. There is no need to filter through stop words or other dictionary. If there are any problems with my modifications, please feel free to communicate with me.

Changing the dictionary for subsequent processing of eng is a smooth action and can be modified. Use english_stem to remove some singular and plural content, and the result may be better.

lddfg commented 10 months ago

before fix: image after fix: image

jaiminpan commented 10 months ago

Thanks for your working. I think the "space" is useful in some usecase but not yours.

Dut to 'x' type is more than the 'space', I can not accept the merge while I appreciate your work. I think the 'right' modification is to change the code in jieba core to identify the specific type('blank | space') of the word and let the caller decide how to do with it.

In your case, setup a custom configuration is a good way to fit your requirement.

lddfg commented 10 months ago

Okay, I understand. Thank you for your work.