rkcosmos / deepcut

A Thai word tokenization library using Deep Neural Network
MIT License
420 stars 96 forks source link

Transform method in DeepcutTokenizer #42

Closed Zylinks closed 6 years ago

Zylinks commented 6 years ago

CountVectorier has the transform method but DeepcutTokenizer doesn't have this method. In the future deepcut will implement this methods or not? Basically, I have to change CountVectorier to DeepTokenizer. because it have to transform test data before predict. // I am not good with English grammar. if you not understand i can explain in Thai lang thx 42943831_332569077309557_2723233342697766912_n

titipata commented 6 years ago

Hi @Zylinks, Iet's say ['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน'] is your training set and ['ฉันหิว', 'เขาไม่มี'] is your validation set. You can use DeepcutTokenizer like the following:

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
                             max_df=1.0, min_df=0.0)
X_train = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน'])

If you see the vocabulary, it will look something like:

tokenizer.vocabulary_
>> {'ได้': 0, 'บิน': 1, 'ฉัน': 2, 'อยาก': 3, 'ข้าว': 4, 'กิน': 5}

Then, when you apply on the validation dataset, you will see that we will have 0 for the second row since words in เขาไม่มี does not exist in the training data (also หิว does not exist in your training set, so you see only 1 at the 3rd column which is ฉัน).

X_test = tokenizer.transform(['ฉันหิว', 'เขาไม่มี'])
X_test.todense()

>> matrix([[0., 0., 1., 0., 0., 0.],
           [0., 0., 0., 0., 0., 0.]])

Currently, we didn't implement fit method in DeepcutTokenizer. You can just do

_ = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน'])

instead of .fit() where you just neglect the output.

Zylinks commented 6 years ago

Thank you