Chinese text classification issue

pommedeterresautee / fastrtext

R wrapper for fastText

https://pommedeterresautee.github.io/fastrtext/

Other

101 stars 15 forks source link

Chinese text classification issue #34

Closed sound118 closed 5 years ago

sound118 commented 5 years ago

I got an issue with Chinese text classification prediction model as folloing:

test_sentences$text2[9] [1] "蛋白粉开封后两个月在次食用味道发苦" predict(model,test_sentences$text2[9]) [[1]] label262 0.5312194

predict(model, "蛋白粉开封后两个月在次食用味道发苦") [[1]] label314 0.9935217

Basically, after you trained the model using "fastrtext", if you try to predict a Chinese tokenized text and put it as an object (e.g. test_sentences$text2[9] in my case), it will give you a wrong prediction with low probability. If you just simply copy the tokenized Chinese text into the prediction model like I did above, it will give a correct one with high probability. I am really confused about this situation. Anyone can help with it? Much appreciated!

pommedeterresautee commented 5 years ago

I have no experience with chinese. Can you tell me what happens if you do

a <- test_sentences$text2[9]
predict(model, a)

sound118 commented 5 years ago

I have no experience with chinese. Can you tell me what happens if you do
a <- test_sentences$text2[9]
predict(model, a)

When you do a <- test_sentences$text2[9] predict(model, a)

It will give an output as below [[1]] label262 0.5312194 This actually gives a wrong prediction which has a low probability of 0.53. However, when you paste the value of "a" which is a tokenized Chinese term "蛋白粉开封后两个月在次食用味道发苦" as shown below:

predict(model, "蛋白粉开封后两个月在次食用味道发苦")

It will a correct predict with high probability of 0.9935 as below:

[[1]] label314 0.9935217

This is so weird since the two R objects ( a<-test_sentences$text2[9] and c("蛋白粉开封后两个月在次食用味道发苦") are basically the same. Why is there such a quite different prediction?

pommedeterresautee commented 5 years ago

It may be due to encoding issues. Is your source code file encoded in UTF 8 ? Another test: test_sentences$text2[9] == "蛋白粉开封后两个月在次食用味道发苦" is it true ?

sound118 commented 5 years ago

It may be due to encoding issues. Is your source code file encoded in UTF 8 ? Another test: test_sentences$text2[9] == "蛋白粉开封后两个月在次食用味道发苦" is it true ?

Yes, the source code file is encoded in UTF-8, `test_sentences$text2[9] == "蛋白粉开封后两个月在次食用味道发苦" is true. My train dataset is in excel format and encoded in UTF-8, my test dataset is .txt format encoded in UTF-8, this problem is still happening now... So puzzled！

sound118 commented 5 years ago

It may be due to encoding issues. Is your source code file encoded in UTF 8 ? Another test: test_sentences$text2[9] == "蛋白粉开封后两个月在次食用味道发苦" is it true ? Thanks to your hint. It is really due to the encoding issue. Once I used predict(model, enc2native(test_sentences[["text2"]])) , I am able to predict all correct labels. Just need to add "enc2native" function for encoding~

pommedeterresautee commented 5 years ago

Tks for your report, I close the issue as it is not related to the package.