piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.55k stars 4.37k forks source link

Add mSDA model #294

Open phdowling opened 9 years ago

phdowling commented 9 years ago

Hey!

I suggested this a while back, and I made some progress so I figured I would bring it up again. I implemented the marginalized stacked denoising autoencoder algorithm described in http://www.cse.wustl.edu/~mchen/papers/msdadomain.pdf - is there any interest in adding this to Gensim?

My implementation is memory independent, and trains a 1000-dimensional model on Wikipedia in around 12 hours on my machine (8 cores @ 3ghz). I haven't tested it thoroughly yet, but some initial tests confirm that the representations it generates are able to capture topical similarity. Would be great if you could give me some ideas for a benchmark I could use to properly validate the model!

Repository is at https://github.com/phdowling/mSDA.

piskvorky commented 9 years ago

Looks good, thanks a lot Philipp!

CC @cscorley @gojomo @temerick @maciejkula for help with benchmarks / code review :)

phdowling commented 9 years ago

Hey again! How should I proceed with this - are there maybe some common classification tasks that I could look at for benchmarks? At what point should I create a PR?

phdowling commented 9 years ago

Small update: I ran a basic benchmark of text classification in Reuters 21578, comparing mSDA to simple bag of words, LSI, and random noise features. The good news is that mSDA is significantly better than random noise, the bad news that it is outperformed by LSI, which is however also outperformed by bag of words features.

More detailed results are below. I'm training only on the Reuters documents, so that might explain why both LSI and mSDA don't learn features that outperform bag of words. Note that this is mSDA at 200 dims, which is not typical - usually, around 1000 dimensions are so would be used.

### EVALUATION RESULTS ###
noise:
TP: 1    FP: 0
FN: 214  TN: 1785
Total test samples: 2000
Accuracy: 0.893
P:1.0    R: 0.0046511627907
F1: 0.00925925925926

msda:
TP: 17   FP: 10
FN: 197  TN: 1775
Total test samples: 1999
Accuracy: 0.896448224112
P:0.62962962963      R: 0.0794392523364
F1: 0.141078838174

bow:
TP: 119  FP: 15
FN: 95   TN: 1770
Total test samples: 1999
Accuracy: 0.944972486243
P:0.888059701493     R: 0.556074766355
F1: 0.683908045977

lsi:
TP: 83   FP: 26
FN: 131  TN: 1759
Total test samples: 1999
Accuracy: 0.921460730365
P:0.761467889908     R: 0.38785046729
F1: 0.513931888545

Here's mSDA at 1000 dimensions, otherwise the same task:

msda:
TP: 33   FP: 47
FN: 181  TN: 1738
Total test samples: 1999
Accuracy: 0.885942971486
P:0.4125     R: 0.154205607477
F1: 0.224489795918

mSDA is also quite a bit slower in evaluation, since it needs to do around (size_of_dictionary / output_dimensionality) + num_layers dot products to generate each representation (or chunk thereof).

I can't say for sure that there are no errors in my implementation, but since there seems to be some learning of useful patterns happening, I'm so far inclined to believe that the dimensional reduction mSDA does is simply not a very good model.

My question is now: If that was the case, would you want mSDA in Gensim anyway? Even if it's less useful than other models, I could still see that someone might want to use it for comparative purposes at some point.

tmylk commented 8 years ago

@piskvorky Is mSDA still on the wishlist?