rkcosmos / deepcut

A Thai word tokenization library using Deep Neural Network
MIT License
420 stars 96 forks source link

deepcut with user dictionary support ? #22

Closed igodhand closed 7 years ago

igodhand commented 7 years ago

Dear K.Rakpong krub,

May I have a suggestion about using deepcut with user dictionary support ?

According to your sample "โรงเรียน", there are many words which shouldn't be cut such as "ขี้เกียจ"

d1

And I have some specific word which I don't want deepcut to cut it such as "หูอื้อ" in middle of sentence (if there is only "หูอื้อ", it works fine !)

My workaround for this issue is to enclose every word such as "ขี้เกียจ" / "หูอื้อ" with "#" and replace them in the last process.

d2

I don't know is this the best way to achieve this, but it would be great if we can input dictionary for the corpus which shouldn't be cut because of I think that there will be many users which will have their own specific corpus which don't want to be cut in their own project, it would be nice to have this feature.

Best Regards,

Ping

rkcosmos commented 7 years ago

Hi Ping,

Adding # is a nice work around. I agree we should implement this feature properly as many people asked for it. I will take care of it soon in the next version.