Try using different textual features

mozilla / bugbug

Platform for Machine Learning projects on Software Engineering

Mozilla Public License 2.0

494 stars 309 forks source link

Try using different textual features #17

Open marco-c opened 5 years ago

marco-c commented 5 years ago

Right now we are using TfidfVectorizer with its default options (basically word 1-gram). We should try a few different options and see how the accuracy changes:

[ ] word 1-gram
[ ] word 2-gram
[ ] word 3-gram
[ ] char 1-gram
[ ] char 2-gram
[ ] char 3-gram
[ ] word2vec
[ ] doc2vec
[ ] fasttext
[ ] 1d-cnn (with pre-trained word embeddings)
[ ] 1d-cnn (without pre-trained word embeddings)
[ ] bert or similar

marco-c commented 5 years ago

Regarding fastText, see https://github.com/RaRe-Technologies/gensim-data/issues/26#issuecomment-408814033.

ayush1999 commented 5 years ago

@marco-c I just tried using 2-gram features for the TfIdfVectorizer, here is the gist of the results: Before:

X: (11007, 21990), y: (11007,)
Cross Validation scores:
Accuracy: f0.951746383784194 (+/- 0.0027559934918468167)
Precision: f0.9821515475460835 (+/- 0.0025632765014633823)
Recall: f0.9621308627678055 (+/- 0.002984157203642199)

After:

X: (11007, 95124), y: (11007,)
Cross Validation scores:
Accuracy: f0.9525540057386749 (+/- 0.004604483751288196)
Precision: f0.9830817914064187 (+/- 0.0035752238919424186)
Recall: f0.9621308627678055 (+/- 0.0035581451061348556)

Unfortunately the terminal crashes with terminate called after throwing an instance of 'std::bad_alloc' in the second case.

marco-c commented 5 years ago

Results for which model?

To avoid OOMs, you can try to reduce the number of ngrams considered.

ayush1999 commented 5 years ago

These are the results for the bug model. Should I try it out for other models too? To generalize, won't it be better to have a feature of passing the value of n from the terminal? (something like python run.py --train --model=bug --ngrams=2)

marco-c commented 5 years ago

These are the results for the bug model. Should I try it out for other models too?

Yes, let's try the tracking, regression and component models too.

To generalize, won't it be better to have a feature of passing the value of n from the terminal? (something like python run.py --train --model=bug --ngrams=2)

Yes, it would be nice, but not so easy.