Computing different tokenization functions using LIL matrix the best configuration is:
lower=True punct=True url=False mention=True
8312 unique terms in vocabulary
acc= 0.814
However, when I change the tokenizer method with this configuration the accuracy decreases, so the way as it is being used is the best.
PS.: I am not sure about the method I have used, I will show you in the meeting today.
some discrepancy between gridsearch reported accuracy and that of cross-validation with the best parameters. May look at this more closely at a later date. (Perhaps restarting notebook will help.)
Computing different tokenization functions using LIL matrix the best configuration is: lower=True punct=True url=False mention=True 8312 unique terms in vocabulary acc= 0.814
However, when I change the tokenizer method with this configuration the accuracy decreases, so the way as it is being used is the best.
PS.: I am not sure about the method I have used, I will show you in the meeting today.