tapilab / protest

analyze brazilian protests on Twitter
0 stars 0 forks source link

What is the best tokenization function? #5

Closed aronwc closed 9 years ago

ElaineResende commented 9 years ago

Computing different tokenization functions using LIL matrix the best configuration is: lower=True punct=True url=False mention=True 8312 unique terms in vocabulary acc= 0.814

However, when I change the tokenizer method with this configuration the accuracy decreases, so the way as it is being used is the best.

PS.: I am not sure about the method I have used, I will show you in the meeting today.

aronwc commented 9 years ago

some discrepancy between gridsearch reported accuracy and that of cross-validation with the best parameters. May look at this more closely at a later date. (Perhaps restarting notebook will help.)