Preprocessing Text for multiclass text classification

86mm86 commented 5 years ago

I am trying to perform supervised multiclass email text classification with fastrtext loading the Wikipedia pretrained vectors. I am experiencing low performance in overall accuracy (< 0.2) despite

1) I have reduced the number of classes from 150 to 32 (setting a threshold and performing sampling on the overrepresented classes); 2) I have tuned hyperparameters such as loss function, learning rate, minimum and maximum characters n-grams, epochs etc.

I have also performed some text preprocessing, even tohugh I have not been able to obtain a perfectly clean text as desired. Do you think the low accuracy is related to text preprocessing (low quality of text) or am I missing something obvious?

If this is the case, could you point me to some R libraries that could help me achieving my goal?

Any help appreciated!

pommedeterresautee commented 5 years ago

You can't load Wikipedia pretrained vectors and change the vector size after. I think you have an error there. For classification, you don't need pretrained vector if your dataset is not very small. So may be you want to remove the loading of pretrained vectors.

86mm86 commented 5 years ago

Actually I was not changing the vector dimension (it was always 300). I figured out that the problem was exactly what you pointed out above: with my own word embedding in the text classifier, I increased the accuracy to 0.6

pommedeterresautee / fastrtext

Preprocessing Text for multiclass text classification #31