Why increasing wordNgrams in execute() makes accuracy decrease

pommedeterresautee / fastrtext

R wrapper for fastText

https://pommedeterresautee.github.io/fastrtext/

Other

101 stars 15 forks source link

Why increasing wordNgrams in execute() makes accuracy decrease #29

Closed lantianx2020 closed 5 years ago

lantianx2020 commented 6 years ago

Hi, I noticed that the accuracy of my prediction(using the function predict) decreased from 0.9 to 0.8 when I increased the parameter wordNgrams from 1 to 2 during training. And as I kept increasing wordNgrams, the accuracy kept decreasing and it even hit 0.002 when wordNgrams became 5. I was really confused since I thought increasing wordNgrams would improve the performance of training. Can anyone tell me what's going on? Thanks!

pommedeterresautee commented 6 years ago

It is related to the way fasttext managed ngrams. Basically unigrams are in dict, but ngrams are managed by the hashing trick. Because hashing implies collisions, more ngrams you have more collisions you have. To fix, you can increase the size of the hashing trick (I have used up to 2^30, usually 2^25 is good enough for everything). But if just adding bigrams decrease the quality of your model, there are some chances ngram are not a good choice in your case.

lantianx2020 commented 6 years ago

@pommedeterresautee Thanks for your reply! Do you have any idea that in what cases ngram does not improve the performance of training? Thank you!

pommedeterresautee commented 6 years ago

When you can do the stuff without using bigram :-) Some topical classification works well. When your text is long and so on. n-gram is a way to have local word order, and this is very useful for sentiment classification for instance.