Closed lddfg closed 10 months ago
Hello, I don’t know much about this project, but in my use, when I use a mixture of Chinese and English to input word segmentation, there will be a lot of ` entered into the lexemes(Others can be removed using stop words, but
` doesn’t seem to work.). I am a bit irresponsible and think that token x means blank. There is no need to filter through stop words or other dictionary. If there are any problems with my modifications, please feel free to communicate with me.
Changing the dictionary for subsequent processing of eng is a smooth action and can be modified. Use english_stem to remove some singular and plural content, and the result may be better.
before fix: after fix:
Thanks for your working. I think the "space" is useful in some usecase but not yours.
Dut to 'x' type is more than the 'space', I can not accept the merge while I appreciate your work. I think the 'right' modification is to change the code in jieba core to identify the specific type('blank | space') of the word and let the caller decide how to do with it.
In your case, setup a custom configuration is a good way to fit your requirement.
Okay, I understand. Thank you for your work.