rkcosmos / deepcut

A Thai word tokenization library using Deep Neural Network
MIT License
420 stars 96 forks source link

Could you explain how to use stopword? #49

Closed MissSoYa closed 5 years ago

MissSoYa commented 5 years ago

Hello Deepcut team, I would like to manually custom the stop word. Please let me know how.

Thank you :)

titipata commented 5 years ago

Hello @MissSoYa, we do have the custom stop words in DeepcutTokenizer but not currently in deepcut.tokenize itself. However, for deepcut.tokenize, you can manually do that by using for loop to remove the stop words

import deepcut
stop_words = ['ฉัน', 'อยาก']
[w for w in deepcut.tokenize('ฉันอยากกินข้าวของฉัน') if w not in stop_words]

>> ['กิน', 'ข้าว', 'ของ']

For DeepcutTokenizer, adding stop words can be done as follows

from deepcut import DeepcutTokenizer

raw_documents = ['ฉันอยากกินข้าวของฉัน',
                 'ฉันอยากกินไก่มาก',
                 'อยากนอนอย่างสงบ']
tokenizer = DeepcutTokenizer(ngram_range=(1, 1), stop_words=['ฉัน', 'อยาก'])
X = tokenizer.fit_tranform(raw_documents) # will not have ฉัน, อยาก in tokenizer.vocabulary_
MissSoYa commented 5 years ago

👍 Thank you but DeepcutTokenizer is not work.

titipata commented 5 years ago

Can you give a little more details and example why it doesn't work?

titipata commented 5 years ago

@MissSoYa if you re-install deepcut via the repository with proper dependencies, DeepcutTokenizer should work properly now. I will close the issue for now.