Open marco-c opened 5 years ago
Regarding fastText, see https://github.com/RaRe-Technologies/gensim-data/issues/26#issuecomment-408814033.
@marco-c I just tried using 2-gram features for the TfIdfVectorizer
, here is the gist of the results:
Before:
X: (11007, 21990), y: (11007,)
Cross Validation scores:
Accuracy: f0.951746383784194 (+/- 0.0027559934918468167)
Precision: f0.9821515475460835 (+/- 0.0025632765014633823)
Recall: f0.9621308627678055 (+/- 0.002984157203642199)
After:
X: (11007, 95124), y: (11007,)
Cross Validation scores:
Accuracy: f0.9525540057386749 (+/- 0.004604483751288196)
Precision: f0.9830817914064187 (+/- 0.0035752238919424186)
Recall: f0.9621308627678055 (+/- 0.0035581451061348556)
Unfortunately the terminal crashes with terminate called after throwing an instance of 'std::bad_alloc'
in the second case.
Results for which model?
To avoid OOMs, you can try to reduce the number of ngrams considered.
These are the results for the bug model. Should I try it out for other models too? To generalize, won't it be better to have a feature of passing the value of n
from the terminal? (something like python run.py --train --model=bug --ngrams=2
)
These are the results for the bug model. Should I try it out for other models too?
Yes, let's try the tracking, regression and component models too.
To generalize, won't it be better to have a feature of passing the value of n from the terminal? (something like python run.py --train --model=bug --ngrams=2)
Yes, it would be nice, but not so easy.
Right now we are using
TfidfVectorizer
with its default options (basically word 1-gram). We should try a few different options and see how the accuracy changes: