glouppe / kaggle-marinexplore

Code for the Kaggle Marinexplore challenge
17 stars 11 forks source link

Cleanup input signal #1

Open glouppe opened 11 years ago

glouppe commented 11 years ago

For the moment, I have only been experimenting with the raw signal data and its FFT transform. However, I am sure much can be gained by cleanup the input signal.

From the top of my head, things worth to be investigated include:

pprett commented 11 years ago

This NIPS09 paper might give some guidance on input features (and preprocessing)::

Unsupervised feature learning for audio classification using convolutional deep belief networks, Honglak Lee, Yan Largman, Peter Pham and Andrew Y. Ng. In NIPS*2009.
http://ai.stanford.edu/~ang/papers/nips09-AudioConvolutionalDBN.pdf

They introduce convolutional deep belief networks but in the course of the paper the also describe the current state-of-the-art in speech recognition (and audio classification) such as MFCC and other spectrogram statistics. They get their best results by combining MFCC w/ features learned from a DBN. The DBN uses the spectrogram as input (not totally sure how this works though) - they say that whitening is a crucial preprocessing step.

As a follow up: this MO thread might also be of interest (discusses the paper): http://metaoptimize.com/qa/questions/8561/deep-belief-network-for-audio-feature-extraction

(especially the response by gdahl)

glouppe commented 11 years ago

Thanks! I'll read all this. I think we should set up some kind of protocol to assess the quality of the input features, before feeding them to heavy machine learning algorithms. On the MO thread, gdhal recommends testing the features using a simple model first (eg., an SVM).

For the MFCC, the Yaafe package includes them, but I got strange results when playing with it (results were different from one execution to another?).

glouppe commented 11 years ago

Just read the paper by Lee. Very clear and interesting!

Things that are worth noting, besides using CDBNs:

glouppe commented 11 years ago

Here are the results of features.py, using 100 extra-trees, cv=5 on a subsample of 1000 training points.

raw = 0.7843443977
normalize(raw) = 0.662009141614
fft(raw) = 0.926136785105
fft(normalize(raw)) = 0.887522474311
fft(raw).real + fft(raw).imag = 0.872252246842
spectrogram = 0.92233896501
GaussianRandomProjection(raw) = 0.829513318568
SparseRandomProjection(raw) = 0.852775128547
GaussianRandomProjection(fft(raw)) = 0.889894183819
SparseRandomProjection(fft(raw)) = 0.896566920169

And the same using an SGDClassifier instead:

raw = 0.513781889385
normalize(raw) = 0.52960733684
fft(raw) = 0.796844059187
fft(normalize(raw)) = 0.791280968041
fft(raw).real + fft(raw).imag = 0.513781889385
spectrogram = 0.888366754857
GaussianRandomProjection(raw) = 0.505434691741
SparseRandomProjection(raw) = 0.524117423346
GaussianRandomProjection(fft(raw)) = 0.782104031188
SparseRandomProjection(fft(raw)) = 0.834647587492

In both cases, both fft(raw) and spectogram appear to be strong features.

I still have to investigate things like MFCC, signal denoising, PCA whitening of the spectrogram and things like that.

glouppe commented 11 years ago

Picture http://ow.ly/i/1uHDS suggests that calls are bounded in terms of frequency. As a result, I tried truncating the spectrogram features up to some upper bound. The results are quite interesting:

Extra-trees:

spectrogram = 0.936242248055
spectrogram[< 50 Hz] = 0.901118534846
spectrogram[< 100 Hz] = 0.927637427396
spectrogram[< 200 Hz] = 0.941652390254
spectrogram[< 300 Hz] = 0.946849298778
spectrogram[< 400 Hz] = 0.941359620771
spectrogram[< 500 Hz] = 0.93191926203
spectrogram[< 1000 Hz] = 0.914727076202

SGDClassifier:

spectrogram = 0.888366754857
spectrogram[< 50 Hz] = 0.888221106013
spectrogram[< 100 Hz] = 0.899041802513
spectrogram[< 200 Hz] = 0.864002717522
spectrogram[< 300 Hz] = 0.865245736798
spectrogram[< 400 Hz] = 0.873808993963
spectrogram[< 500 Hz] = 0.866218416749
spectrogram[< 1000 Hz] = 0.888366754857

All in all, this indicates that we can at least reduce by a factor of 2 the numbers of input features without any degradation (it even improves).

pprett commented 11 years ago

very interesting - i've heard about this in speech recog - humans have a higher resolution in the low freq spectrum and have troubles distinguishing sounds in the higher spectrum. we could investigate logarithmic freq bins in the spectrogram. Am 15.02.2013 22:41 schrieb "Gilles Louppe" notifications@github.com:

Picture http://ow.ly/i/1uHDS suggests that calls are bounded in terms of frequency. As a result, I tried truncating the spectrogram features up to some upper bound. The results are quite interesting:

Extra-trees:

spectrogram = 0.936242248055 spectrogram[< 50 Hz] = 0.901118534846 spectrogram[< 100 Hz] = 0.927637427396 spectrogram[< 200 Hz] = 0.941652390254 spectrogram[< 300 Hz] = 0.946849298778 spectrogram[< 400 Hz] = 0.941359620771 spectrogram[< 500 Hz] = 0.93191926203 spectrogram[< 1000 Hz] = 0.914727076202

SGDClassifier:

spectrogram = 0.888366754857 spectrogram[< 50 Hz] = 0.888221106013 spectrogram[< 100 Hz] = 0.899041802513 spectrogram[< 200 Hz] = 0.864002717522 spectrogram[< 300 Hz] = 0.865245736798 spectrogram[< 400 Hz] = 0.873808993963 spectrogram[< 500 Hz] = 0.866218416749 spectrogram[< 1000 Hz] = 0.888366754857

All in all, this indicates that we can at least reduce by a factor of 2 the numbers of input features without any degradation (it even improves).

— Reply to this email directly or view it on GitHubhttps://github.com/glouppe/whale-challenge/issues/1#issuecomment-13629625.

glouppe commented 11 years ago

Hmm nice, using spectrogram[< 500 Hz] as input features into an AdaBoost classifier, I got 0.91131 on the leaderboard. Still not at the very top, but still, I am happy to see that the gap closes without resorting to neural networks.

pprett commented 11 years ago

very nice - I'm currently playing with spectrograms - especially the NFFT and overlap parameters.

2013/2/16 Gilles Louppe notifications@github.com

Hmm nice, using spectrogram[< 500 Hz] as input features into an AdaBoost classifier, I got 0.91131 on the leaderboard.

— Reply to this email directly or view it on GitHubhttps://github.com/glouppe/whale-challenge/issues/1#issuecomment-13647336.

Peter Prettenhofer

pprett commented 11 years ago

I did a quick grid search over various spectrogram parameters (NFFT, noverlap, clip) - there are the results (using SGDClassifier)::

{'spectrogram__NFFT': [256, 512],
 'spectrogram__noverlap': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
 'spectrogram__clip': [1.0, 0.75, 0.5, 0.25, 0.125]}

Best score: 0.893
Best parameters set:
spectrogram__NFFT: 256
spectrogram__clip: 0.25
spectrogram__noverlap: 0.6

The default overlap of mlab.spectrogram is 0.5 - so its basically our current parameter settings :-/ btw: matplotlibs noverlap is int(noverlap * NFFT).

glouppe commented 11 years ago

Could you put your script files in the repo, even if it is all messy? (in a new directory and we will reorganize later)

On my side, I got the results of the summary statistics features. They seem quite strong as well, which is a good news.

Trees:

spectrogram_stats(X) = 0.937920594888
spectrogram_stats(X, upper=50) = 0.913728522094
spectrogram_stats(X, upper=100) = 0.934755734997
spectrogram_stats(X, upper=200) = 0.950075679704
spectrogram_stats(X, upper=300) = 0.944184054454
spectrogram_stats(X, upper=400) = 0.940859151563
spectrogram_stats(X, upper=500) = 0.946754426865
spectrogram_stats(X, upper=1000) = 0.929121526711

SGD:

spectrogram_stats(X) = 0.853735474854
spectrogram_stats(X, upper=50) = 0.894673926738
spectrogram_stats(X, upper=100) = 0.895151200455
spectrogram_stats(X, upper=200) = 0.849708525697
spectrogram_stats(X, upper=300) = 0.796667443774
spectrogram_stats(X, upper=400) = 0.84368664658
spectrogram_stats(X, upper=500) = 0.827221615843
spectrogram_stats(X, upper=1000) = 0.853735474854
glouppe commented 11 years ago

From a scientific point of view, I'd be curious to study the variable importances of those statistical features. I am sure we could isolate some meaningful characteristics of the whale calls (e.g., basic and understandable decision rules for humans).

glouppe commented 11 years ago

Combining spectrogram(upper=500) with spectrogram_stats(upper=500) gives a score of 0.92532 on the leaderboard, still using AdaBoost. No parameter tuning.

glouppe commented 11 years ago

This paper on unsupervised feature learning looks interesting (I have only skimmed through it).

glouppe commented 11 years ago

Combining spectrogram(upper=500) with spectrogram_stats(upper=500), but using a forest of extra-trees instead gives 0.94256!

Looks like we may be able to close the gap using trees after all... :)

pprett commented 11 years ago

Gilles, sorry for the late reply - my scripts are in the "transformer" branch - I'll merge the SpectrogramTransformer in a minute.

0.94256 is quite impressive - great job!

pprett commented 11 years ago

This might be interesting:

when you post-process the spectrogram with 10. * np.log10(Pxx) you get much better results with linear classifiers - thats the same transformation that matplotlib does before plotting Pxx::

clf = clf = LinearSVC(C=0.001, loss='l1', dual=True)
clf = Pipeline(steps=[('scale', StandardScaler()),
                               ('svm', clf)])
st = SpectrogramTransformer(NFFT=256, clip=500, noverlap=0.6, dtype=np.float64,
                                           whiten=None)
X = st.fit_transform(X)

print X.shape
scores = cross_val_score(clf, X, y, scoring="roc_auc", cv=5)
print "spectrogram =", scores.mean(), scores.std()

gives me:

spectrogram: 0.93816 (0.02066)

without log transformation:

spectrogram: 0.86609 (0.02478)

I played a bit with whitening (as proposed by Lee09) - performance gets much worse - maybe I'm holding it wrong... I tried whitening the spectrograms individually and X as a whole. My best result is with individual whitening and a single component:

X.shape (1000, 65)
spectrogram: 0.88599 (0.03992)
pprett commented 11 years ago

BTW: The log transform won't help the trees though ;-(

pprett commented 11 years ago

I quickly computed the mean spectrograms of right whale calls and non-right whale calls as well as the difference between the two mean spectrograms.

What surprises me is the relatively high amplitude of the frequency band at the bottom of the pos mean spectrogram. This differs from the spectrogram reported on the forum - and its quite distinctive, so our/my classifiers might pick it up - might indicate a pbl with the way we compute our spectrograms.

mean_pos_specg

mean_neg_specg

mean_diff_specg

glouppe commented 11 years ago

I have recomputed locally the mean spectrograms and I observe the same artefact. This is quite odd since we compute them exactly in the same way.

I'll check in matlab right now to see if this appears as well.

pprett commented 11 years ago

you can dampen the effect by fiddling with the pad_to and noverlap parameters of the spectrogram. Artefacts on the edges are pretty common for rolling computations - it might also be an effect of the filter that the spectrogram uses - matplotlib uses a hamming filter which is also the one described in the NLP & SR book.

I also tried to clip the lower 50 Hz but the performance got worse.

pprett commented 11 years ago

I did some data exploration - here are some nice examples of whale sounds that you should listen to - both examples contain the sample index in the title - you need to add one to get to the file name.

this one is interesting because it does not have this artefact at the bottom - its actually a very good recording of an upcall - the best I've heard so far :whale2:

specg_23087

the next one is a poor recording - interesting for two reasons: a) artefact at bottom and b) a characteristic click sound at the end that you can see very often in the dataset (also non-whale recording)

specg_19130

pprett commented 11 years ago

I made some visualizations of our MFCC features - preliminary experiments using a linear svm showed that most information is contained in mfcc; ceps have hardly any contribution at all.

Below you can see the mean MFCC proviles of the positive and negative recordings - the profiles look very similar except for a small region (23 features - X_mfcc[:, 368:391]) - where there is a significant difference between positive and negative records.

mean_mfcc_profiles

pprett commented 11 years ago

the cool thing is: if a train a GBRT soley on the mfcc features 368 - 391 I get an AUC of 0.874751 - which is the same as the < 50 Hz linear svm (the artefact) :microphone: vs. :whale2:

glouppe commented 11 years ago

That looks quite interesting! I have been looking at it a bit further. Features from 299 and further correspond to the stat features. The first 115 are for axis=1 and the 65 after for axis=0.

On axis=1, the large gap that we see corresponds to the variance statistic. If you actually zoom in, there is also a net difference for the max statistic (the bump at features 23:46). The 3 remaining stats (min, mean, median) closely match each other though.

axis1

On axis=0 though, all stat metrics seem equally interesting. There always seem to a possible cut between the negative and the positive examples.

axis0

glouppe commented 11 years ago

Some stats extracted from et-importances.txt

importances

feature set             cumulative imp      mean imp of a feature
------------------------------------------------------------------
specs                   0.162032208626      0.000127183837226
specs_stats_axis1       0.0440801794733     8.99595499455e-05
specs_stats_axis0       0.102573915657      0.00157806024087
ceps                    0.1591890038        0.000180486398866
ceps_stats_axis1        0.114974394981      0.00023464162241
ceps_stats_axis0        0.0346732776882     0.00077051728196
mfcc                    0.267061025903      0.000893180688639
mfcc_stats_axis1        0.0701973093015     0.00061041138523
mfcc_stats_axis0        0.0452186845697     0.000695672070303
glouppe commented 11 years ago

We should have a closer look at the peaks. Most of them seem to correspond to statistics if I am not wrong. We should see which and why. There is also a peak in the spectrogram, does it correspond to middle frequencies or to windows at the middle in time?

glouppe commented 11 years ago

Shall we try to prune the unimportant features? This may help in generalization.

glouppe commented 11 years ago

After many tries with different models (DBNs, multiframes, etc), without great success (nothing better than what we already had), I have come to think that feature extraction is really the only direction worth investigating. In my opinion, it is only with good features that we will be able to build (very) good models.

What is your opinion Peter?

So in the short term, I think we should try to tune the features we have, improve those that work best and prune those that have slight or no effect.

glouppe commented 11 years ago

So, I have been looking at feature importances again. If you extract them and reshape them to fit the dimensions of specs, ceps and mfcc mat arrays, you get the following figures. In all three figures, the vertical axis correspond to time bins, and the horizontal axis to coefficients/frequency band/whatever.

imp-specs imp-ceps imp-mfcc

A few comments:

pprett commented 11 years ago

A few comments:

  • On all them, the most important features are at mid-time. This suggests we could simply strip away the coefficients computed at the begining and the end of the audio sample.

true - but lets first check that those are not the kind of recordings that we get wrong (upcall at beginning/end vs. mid) . Furthermore, the clear shape of the mean positive spectrogram makes me believe that the (positive) recordings might be "centered" - i.e. they might have been picked by hand.

  • On MFCC, only the first 6 coefficients seems to be useful. It would be interesting to try to compute more coefficients in that "region" (dunno yet if that makes sense?).

That sounds reasonable - I've to re-read the MFCC section the the NLP book - don't recall what the first coefficients stand for...

glouppe commented 11 years ago

Also, I analyzed which features were the most important from those generated by StatsTransformer.

Clearly, both min and max appear to be valuable operators. This makes me think that we could try do add more operators of that type:

pprett commented 11 years ago

2013/3/4 Gilles Louppe notifications@github.com

Also, I analyzed which features were the most important from those generated by StatsTransformer.

  • specs, axis=1: min
  • specs, axis=0: min, max, var
  • ceps, axis=1, max, median
  • ceps, axis=0, mean
  • mfcc, axis=1: min, max, var
  • mfcc, axis=0: max

Clearly, both min and max appear to be valuable operators. This makes me think that we could try do add more operators of that type:

  • retrieve the top k peaks (and vice-versa for min)
  • compute min/max over "blocks", like a max-pool in neural network
  • compute overall stats for the flattened data

I tried percentiles - didn't help

min/max over blocks sounds interesting - haven't tried that yet

overall stat similar to a "color historgram" in image processing?

— Reply to this email directly or view it on GitHubhttps://github.com/glouppe/whale-challenge/issues/1#issuecomment-14404118 .

Peter Prettenhofer

pprett commented 11 years ago

Here is the blog post on Marine Explorer adv. the competition - look at the cleaned spectrogram - looks neat:

http://marinexplore.com/blog/whale-detection-challenge-and-the-benefit-of-ocean-analytics/

I've looked up some work on spectrogram cleaning - most approaches basically apply image processing techniques (e.g. hough filter, wiener filter):

  1. Bird Song Recognition through Spectrogram Processing and Labeling, Katie Wolf http://www.cra.org/Activities/craw_archive/dmp/awards/2009/Wolf/DREU/project/final_report/final_report.pdf

    The author studies different spectrogram cleaning techniques; even though those techniques are IMHO not properly evaluated some parts are quite interesting: they want to remove low frequency background noise (running water from a nearby stream) from the spectrogram. We instead would like to remove high frequency background noise (ship propeller).

  2. Spectrogram Enhancement By Edge Detection Approach Applied To Bioacoustics Calls Classification http://airccse.org/journal/sipij/papers/3212sipij01.pdf

    The authors use spectrogram information to classify the sounds of different species of bat. They study two "spectrogram enhancements": dynamic range (something similar is also available in praat) and an edge detector.

pprett commented 11 years ago

Here are some results by applying different filters to the spectrogram below (its X[5])::

spec_5

Here is the median filter (scipy.signal.medfilt2d):

spec_5_med_3

Here is the wiener (scipy.filter.wiener - default params):

spec_5_wiener

pprett commented 11 years ago

Thats the result of applying a wiener filter to the raw signal and wienering the spectrogram::

spec_5_wiener_wiener

pprett commented 11 years ago

Here is a random negative example (X[1]):

That's the raw spectrogram

spec_2

Here is the wiener^2 version

spec_2_wiener_wiener

glouppe commented 11 years ago

Can't wait to see it improves or not the baseline.

pprett commented 11 years ago

I did a quick run on train_small using a GBRT w/ 200 trees of depth 6

baseline: AUC: 0.915751
wiener: AUC: 0.933529
wiener^2: AUC: 0.937875

I'll now run it on our internal test set - I keep you posted

glouppe commented 11 years ago

Quite nice! Could you paste or commit the code to generate wiener^2? No matter the results on our internal test set, I'd like to include it into our full set of features to recompute the overall feature importances.

pprett commented 11 years ago

I've pushed the transformer to master its in the transform module.

You can use it in a Pipeline as follows::

("wiener1", FilterTransformer(signal.wiener, noise=True)),
("spec", SpectrogramTransformer(flatten=False, clip=500.0,
                                                 noverlap=0.5)),
("wiener2", FilterTransformer(signal.wiener)),
("flatten", FlattenTransformer()),
pprett commented 11 years ago

Sometimes the first wiener filter raises an division by zero error - therefore I added some gaussian noise to the signal - if you get another error increase the factor before the noise in transform/init.py . Given the minor difference between wiener^1 and wiener^2 I'd rather use wiener^1 - it has no such issues.

pprett commented 11 years ago

Now I've result on our internal test set::

raw specg - AUC: 0.935533
wiener^1   - AUC: 0.941330

not such a huge improvement as on the subsample but still an improvement!

pprett commented 11 years ago

wiener^2 was worse than wiener^1 - probably due to a high noise factor I had to add to the signal: AUC: 0.939481

glouppe commented 11 years ago

Here are the new feature importances. The wiener^1 features appear to be very strong! I'll start a grid-search for GBRT based on that...

imp-rf2

For further reference, the importances were computed using a RandomForestClassifier, 1000 trees, max_features=auto and min_samples_split=1.

pprett commented 11 years ago

cool - looking forward to the grid search results

glouppe commented 11 years ago

The grid is quite busy - I only have 10 jobs running in parallel - but so far, I couldn't get any better results that what we had before (i.e., nothing better than 0.9715)... The best model I have at the moment scored 0.971477307461. I keep you updated.

glouppe commented 11 years ago

I have just added a PoolTransformer into transform.pool. I'll play a bit with it to try to find good parameters (block size and step size).

glouppe commented 11 years ago

Results of PoolTransformer used in place of FlattenTransformer in load_data, for a RandomForestClassifier built on 1000 with default parameters:

I don't know if these results are significantly poorer, but it is clear that they do not seem better.

Too bad :(