glouppe / kaggle-marinexplore

Code for the Kaggle Marinexplore challenge
17 stars 11 forks source link

Deep Belief Networks #7

Open glouppe opened 11 years ago

glouppe commented 11 years ago

Using the raw spectrograms as input features, the best AUC I got is 0.9307. The configuration was the following:

DBN(dropouts=0, epochs=30, epochs_pretrain=0, fan_outs=None, l2_costs=0.0001,
  l2_costs_pretrain=None, layer_sizes=[4355, 300, 300, 300, 2],
  learn_rates=0.012, learn_rates_pretrain=None, loss_funct=None,
  minibatch_size=64, minibatches_per_epoch=None, momentum=0.16,
  momentum_pretrain=None, nest_compare=True, nest_compare_pretrain=None,
  nesterov=False,
  output_act_funct=<gdbn.activationFunctions.Softmax object at 0xf2d4210>,
  real_valued_vis=True, rms_lims=None, scales=0.05, uniforms=False,
  use_re_lu=False, verbose=0)
Training...
AUC = 0.93073505487

Anyway, I think we should re-evaluate this using properly whitened input features.

glouppe commented 11 years ago

I added PCA whitening as we discussed into grid.py (this is already becoming a mess - sorry for that). I have launched a grid-search, we will see tomorrow!

glouppe commented 11 years ago

Current best result, as of this morning, is 0.935 with PCA-whitened input features, using only 10 principal components! Another grid-search is ongoing.

glouppe commented 11 years ago

Current best result, using 10 principal components:

['500', '10', 'dbn', '400-400-400', '50', '0.008', '0.06']
Loading data...
Estimator setup...
gnumpy: failed to import cudamat. Using npmat instead. No GPU will be used.
DBN(dropouts=0, epochs=50, epochs_pretrain=0, fan_outs=None, l2_costs=0.0001,
  l2_costs_pretrain=None, layer_sizes=[670, 400, 400, 400, 2],
  learn_rates=0.008, learn_rates_pretrain=None, loss_funct=None,
  minibatch_size=64, minibatches_per_epoch=None, momentum=0.06,
  momentum_pretrain=None, nest_compare=True, nest_compare_pretrain=None,
  nesterov=False,
  output_act_funct=<gdbn.activationFunctions.Softmax object at 0x7bb1690>,
  real_valued_vis=True, rms_lims=None, scales=0.05, uniforms=False,
  use_re_lu=False, verbose=0)
Training...
AUC = 0.936329150275

Not that much better, but still.

This is really a pain in the ass to tune. I am not even pre-training the deep network... Shall I?

What do you think of all this? Shall we try it on the LB? Using the hidden features into a forest? or in combination with our previous features?

glouppe commented 11 years ago

I am not even pre-training the deep network... Shall I?

They seem to do that in Lee09. They learn the features on unlabeled data.

glouppe commented 11 years ago

I just tried my luck on the leaderboard. The DBN trained on all data with the parameters above yielded 0.93789. Without much surprise, this closely matches the validation score.

glouppe commented 11 years ago

Also, I couldn't get any good results with pre-training. AUC drops to 0.5... Maybe I should try to find a good set of pre-training parameters on the subsample first. The number of hyper-parameters of DBNs is really problematic...

glouppe commented 11 years ago

Seems like use_re_lu (rectified linear units) actually makes the learning converge (both err and loss in the verbose mode of DBN were oscillating otherwise, which was odd). Also, on the training subsample, using clip=1000 seems to also improve the end accuracy.

As a result, I am reconducting a new grid-search with both use_re_lu=True and clip=1000. We'll see.

(There is also a bunch of other hyper-parameters for which I have no idea what they are for...)

glouppe commented 11 years ago

Just some update: I couldn't get any better than our previous best results (see above), even with clip=1000 and rectified linear units.

I'll stop investigating DBNs for now. I'd rather focus on feature engineering.

glouppe commented 11 years ago

(Also: AUC drops to 0.5 seems to be due to numerical instabilities. I got some NaN poping... :/ )

glouppe commented 11 years ago

I'll stop investigating DBNs for now. I'd rather focus on feature engineering.

What would be interesting though, is to evaluate convolutional neural network.

In particular: https://code.google.com/p/cuda-convnet/ (but it looks like a nightmare in terms of hyperparameters)

pprett commented 11 years ago

I experimented with DBN's as feature representations: I used the activations of the hidden units in the i-th layer as features for a linear SVM. The results are comparable (slightly better) than using the DBN directly. Whats interesting, however, is the following:

no matter what layer I use - the results are more or less the same - even when using the input layer (ie. the raw input features). I think this might indicate that the DBN does not learn at all... what do you think?

Since I just got my cuda SDK up and running (roughly 1/3 faster than using CPU) I'll spend some more time on this - maybe I'm lucky - I also came across this practical guide to tungin RBMs by Hinton himself... utterly esoteric [1]

[1] http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf

glouppe commented 11 years ago

no matter what layer I use - the results are more or less the same - even when using the input layer (ie. the raw input features). I think this might indicate that the DBN does not learn at all... what do you think?

My intuition is that it basically generates random non-linear features for which we merely fine-tune the final weights upon fitting. That part of the algorithm is indeed basically just backpropagation, so only the weights connecting the final layer and the outputs are well learned (the gradients fade away on the previous layers).

When they advertise Deep Belief Networks, they always mention the importance of unsupervised "pre-training", which put the network in a state where the weights are meaningful (the idea is to pre-train each layer so that it can reproduce the original inputs, like an auto-coder. Each layer can then be viewed as more and more abstract representations of the original features). Then, starting from that state, "fine-tuning" (training with backpropagation) is done to learn to predict the final outputs.

So basically, I would rather investigate pre-training, possibly with no or slight fine-tuning if our goal is to generate good features. I had a try a few days ago, but couldn't get anything meaningful out of it... It is really a pain to calibrate the learning regime.

pprett commented 11 years ago

2013/3/5 Gilles Louppe notifications@github.com

no matter what layer I use - the results are more or less the same - even when using the input layer (ie. the raw input features). I think this might indicate that the DBN does not learn at all... what do you think?

My intuition is that it basically generates random non-linear features for which we fine-tune the final weights upon fitting. That part of the algorithm is indeed basically just backpropagation, so only the weights connecting the final layer and the outputs are well learned (the gradients fade away on the previous layers).

When they advertise Deep Belief Networks, they always mention the importance of unsupervised "pre-training", which put the network in a state where the weights are meaningful (the idea is to pre-train each layer so that it can reproduce the original inputs, like an auto-coder. Each layer can then be viewed as more and more abstract representations of the original features). Then, starting from that state, "fine-tuning" (training with backpropagation) is done to learn to predict the final outputs.

So basically, I would rather investigate pre-training, possibly with no or slight fine-tuning if our goal is to generate good features. I had a try a few days ago, but couldn't get anything meaningful out of it... It is really a pain to calibrate the learning regime.

I'm currently using pre-training w/o fine-tuning and a linear svc on top of that - my current best performance is 0.92 using mfcc/specs/ceps - so not really competitive - learning is indeed very tricky - for pre-training whitening seems crucial... I'll keep you posted.

Reply to this email directly or view it on GitHubhttps://github.com/glouppe/whale-challenge/issues/7#issuecomment-14426908 .

Peter Prettenhofer

pprett commented 11 years ago

my current best DBN run basically mimics the setup reported here: http://arxiv.org/pdf/1207.0580v1.pdf

However, instead of 100 pretrain epochs I used just 10, and I used 500 hidden units instead of 1000.

Here are the parameters::

gdahl_parameters = {
"epochs": 200,
"epochs_pretrain": 100,
"learn_rates_pretrain": [0.001, 0.01, 0.01],
"learn_rates": 1.0,
"l2_costs_pretrain": 0.0001,
"l2_costs": 0.0,
"momentum": 0.5,
"verbose": 2,
"real_valued_vis": True,
"use_re_lu": False,
"scales": 0.01,
"minibatch_size": 200,
"dropouts": [0.2, 0.5, 0.5],
#"output_act_funct": "Sigmoid",
}

gdahl_units = [X_train.shape[1]] + [500, 500] + [2]
dbn = DBN(gdahl_units, **gdahl_parameters)

Its currently running - will take some hours.

pprett commented 11 years ago

My current best DBN uses "epochs_pretrain": 10 instead of 100. It scores AUC: 0.961048 on our internal held-out set. Seems like its pretty sensitive to weight initialization - a other run w/ the same params scored 0.959.

btw: here are the scores of the DBN - seem to be poorly calibrated (clumped at 0 and 1, resp.)

dbn_calibration

I've uploaded a new error report for the dbn - very promising indeed - even though the ROC looks similar (not good) - the FP are very different (very good)

pprett commented 11 years ago

The fact that I computed the hidden layer activation features using X-fold CV was not a good idea, since, the hidden units are completely unrelated between the different folds (due to random initialization).

I'll proceed with a net w/o fine tuning and use those activations as features, however, I don't think that this well be an improvement since the current net hardly uses pre-training. Furthermore, I'm currently blending the DBN predictions w/ gbrt5-20x.txt .

pprett commented 11 years ago

blending DBN 200-10 w/ gbrt5x-100 did not improve upon gbrt5x-100.

The DBN itself scored AUC 0.962 on our internal held-out set and 0.9645 on the LB which is still a Top-10 entry!

glouppe commented 11 years ago

If averaging them did not improve, then we should consider stacking instead. This helped before, and I am sure it can still help.

glouppe commented 11 years ago

Peter, are you aware of PyLearn? http://deeplearning.net/software/pylearn2/features.html#features

They have a lot of stuff for DBNs. It might be interesting to look at the preprocessing modules they have.

glouppe commented 11 years ago

Peter, what is your current best parameters? The ones above with epochs_pretrain=10? Do you want me to grid-search the parameters? with or without all mfcc?

pprett commented 11 years ago

My current best DBN settings is::

DBN(dropouts=[0.2, 0.5, 0.5, 0.5], epochs=200,
  epochs_pretrain=[20, 20, 20, 20], fan_outs=None, fine_tune_callback=None,
  l2_costs=0.0, l2_costs_pretrain=0.0001,
  layer_sizes=[3725, 500, 500, 250, 2], learn_rate_decays=1.0,
  learn_rate_minimums=0.0, learn_rates=1.0,
  learn_rates_pretrain=[0.0001, 0.01, 0.01, 0.01], loss_funct=None,
  minibatch_size=200, minibatches_per_epoch=None, momentum=0.9,
  momentum_pretrain=0.9, nest_compare=True, nest_compare_pretrain=None,
  nesterov=False,
  output_act_funct=<gdbn.activationFunctions.Softmax object at 0x3ef4ad0>,
  pretrain_callback=None, real_valued_vis=True, rms_lims=None, scales=0.01,
  uniforms=False, use_re_lu=False, verbose=2)
  AUC: 0.964232

It used your first RANLP features (specs 4000hz, ceps 4000hz, MFCC 16000hz - is this correct?). I ran the same configuration on all MFCC features (all datasets in grid.py); as well as on the baseline + all higher MFCCs - performance was worse - maybe my parameters are suboptimal for "more" input features. I've updated grid.py - you can reproduce the run using the following command::

python grid.py dbn 500-500-250 200 20 1.0 [0.0001,0.01,0.01,0.01]

PS: DBN does not allow to pass a random seed thus you might get different results (I'm currently performing a second run to quantify the effect of random initialization)

pprett commented 11 years ago

Probably the most important hyper-parameter of DBN in our case is minibatch_size. The suggested size is 10 but due to the class imbalance too small minibatches likely contain only negative instances. It seems that this has a significant influence on performance: e.g. if I set the minibatch_size of our current best run to 10 (from 200) performance drops to AUC: 0.83 - that's a decrease of 0.13!

pprett commented 11 years ago

I've uploaded two files containing activation features. Both files where generated using a 500-500-250 DBN w/ fine-tuning - thus, these features peeked at the training labels.

dbn_act.npz: contains two arrays X_train and X_test - activations have been trained on all training data. This is used to generate submissions.

dbn_act_internal.npz: contains one array X_train, same order as training data - activations have been trained on 50% of the training data (according to our internal validation set - random_state=42); activations for the other 50% are forward propagations. This is intended to be used for our internal experiments.

When I use dbn_act_internal as features for a linear svc I get an AUC of 0.963 (w/o regularization - ie. hard margin) basically the same as using the DBN directly. GBRT on this data is worse than linear svc - only AUC 0.924 using GBRT(250,depth=7,lr=0.1,max_fx=0.3,

pprett commented 11 years ago

I used the following parameter settings for the dbn::

python grid.py dbn 500-500-250 200 20 1.0 [0.0001,0.01,0.01,0.01]
glouppe commented 11 years ago

Adding activations lead to (significantly) poorer results...

['2200', 'gbrt', '2500', '8', '0.125', '0.01', '26', '1.0']
Loading data...
importances.shape = (14270,)
X.shape = (30000, 14270)
X_train.shape = (15000, 2450)
y_train.shape = (15000,)
Estimator setup...
GradientBoostingClassifier(init=None, learning_rate=0.125, loss=deviance,
              max_depth=8, max_features=0.01, min_samples_leaf=1,
              min_samples_split=26, n_estimators=2500, random_state=None,
              subsample=1.0, verbose=0)
Training...
AUC = 0.961201427712

['2000', 'gbrt', '500', '8', '0.13', '0.01', '28', '1.0']
Loading data...
importances.shape = (14270,)
X.shape = (30000, 14270)
X_train.shape = (15000, 2250)
y_train.shape = (15000,)
Estimator setup...
GradientBoostingClassifier(init=None, learning_rate=0.13, loss=deviance,
              max_depth=8, max_features=0.01, min_samples_leaf=1,
              min_samples_split=28, n_estimators=500, random_state=None,
              subsample=1.0, verbose=0)
Training...
AUC = 0.960059874188

While both models, without activations, have an AUC >0.97.

pprett commented 11 years ago

Gilles,

I'm currently running the DBN stacking script - I keep you posted once that's complete.

pprett commented 11 years ago

I've uploaded both stacking files to our Dropbox stack folder - they are called dbn-500-500-250-test.txt and dbn-500-500-250-train.txt .

pprett commented 11 years ago

will try some more hidden layers and units

pprett commented 11 years ago

did two more runs with 500, 500, 500, 250 and 500, 500, 250, 250 - both scored ~0.963 which is roughly the same as our current best results.

pprett commented 11 years ago

in order to make the ensemble more diverse I'm trying to get competitive performance using Spectrograms only. My current best model scores AUC = 0.951 - it uses solely our Spectrogram Transformer - not stats features.

I'm currently fine-tuning the model - I will send you more details (ROC curve, error report) later.

pprett commented 11 years ago

I've added stacking predictions for a DBN trained on spectograms (+ stats); it scores AUC = 0.956 on our internal test set thus its quite inferior compared to our MFCC models - nevertheless, I hope it can add something to the mix - the ROC curve does look similar but the FN,FP look quite different compared to our GBRT predictions.

the files are named dbn-spec-100- - I used the following configuration:

python stacking.py dbn 100-100 200 100 1.0 [0.01,0.01,0.01] dbn-spec-100
glouppe commented 11 years ago

Best DBN so far ['1200', 'dbn', '500-500-500-250', '200', '20', '0.6', '[0.0001,0.01,0.01,0.1,0.1]']. It scores AUC = 0.970541061516.