Closed Zylinks closed 6 years ago
Hi @Zylinks, Iet's say ['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน']
is your training set and ['ฉันหิว', 'เขาไม่มี']
is your validation set. You can use DeepcutTokenizer
like the following:
from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
max_df=1.0, min_df=0.0)
X_train = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน'])
If you see the vocabulary, it will look something like:
tokenizer.vocabulary_
>> {'ได้': 0, 'บิน': 1, 'ฉัน': 2, 'อยาก': 3, 'ข้าว': 4, 'กิน': 5}
Then, when you apply on the validation dataset, you will see that we will have 0
for the second row since words in เขาไม่มี
does not exist in the training data (also หิว
does not exist in your training set, so you see only 1
at the 3rd column which is ฉัน
).
X_test = tokenizer.transform(['ฉันหิว', 'เขาไม่มี'])
X_test.todense()
>> matrix([[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
Currently, we didn't implement fit
method in DeepcutTokenizer
. You can just do
_ = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน'])
instead of .fit()
where you just neglect the output.
Thank you
CountVectorier
has the transform method butDeepcutTokenizer
doesn't have this method. In the futuredeepcut
will implement this methods or not? Basically, I have to changeCountVectorier
toDeepTokenizer
. because it have to transform test data before predict. // I am not good with English grammar. if you not understand i can explain in Thai lang thx