Spectrogram integration issues

jake-g commented 6 years ago

I'm a bit stumped on this. TLDR; im getting weird behavior using kapre as a replacement for spectrogram feats

I have a model and traditionally i have precomputed 64 mel x 128 frame specs and fed them into the first layer. I tried integrating kapre because it seemed like a great idea (and i still think is). I added in the kapre mel spec layer tuned the same way I was generating my precomputed ones (librosa based) and nothing about them is trainable. My new input was pickled raw mono 16khz wav files (~5 seconds).

I started training and noticed there was very little learning taking place compared to the original model. I poked around tried adjusting my input shape, made sure it was (None, 1, 79872) were ~80k was the number of samples per wav.

I also did a similar spectrogram comparison as in the examples/ and the kapre version looked nearly identical to my original. The values were scaled slightly differently, but more or less contained the same information. For example [ -14.019988 -11.445856] became [ -51.93689 -49.89946 ] , see the attached specs for comparison:

Original Version test_spec_original

Kapre Version test_spec_kapre

They basically look the same which is why im confused/surprised the kapre version doesn't train. I tried with and without normalization (frequency wise), transposing, scaling differently and I always have the same issue, they all seem to stop improving after a few epochs. My original model trained for > 50 before it stopped improving.

At this point im trying to figure out if this is a me problem or something going on with kapre so i figured it was worth a shot asking. Thanks for any help resolving this!

Lastly, here is a snippet of my model summary for the old and new kapre one

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
====================================================================
in_layer (InputLayer)           (None, 64, 128)      0                                            
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 64, 128, 1)   0           in_layer[0][0]                   
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 64, 128, 64)  3200        reshape_1[0][0]                  
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 64, 128, 64)  256         conv2d_1[0][0]                   
__________________________________________________________________________________________________
elu_1 (ELU)                     (None, 64, 128, 64)  0           batch_normalization_1[0][0]      
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, 32, 64, 64)   0           elu_1[0][0]                      
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 32, 64, 64)   0           max_pooling2d_1[0][0]            
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 32, 64, 64)   200768      dropout_1[0][0]                  
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 32, 64, 64)   256         conv2d_2[0][0]                   
__________________________________________________________________________________________________
elu_2 (ELU)                     (None, 32, 64, 64)   0           batch_normalization_2[0][0]      
__________________________________________________________________________________________________
.... and 2 more simlar conv layers and some dense layers 
output is multple scaler values,
cost funtion uses mse for each output, 
this traditionally worked well

similarly, the kapre version looked like this, identical other than the first few layer

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
===================================================================
in_layer (InputLayer)           (None, 1, 79872)     0                                            
__________________________________________________________________________________________________
log-power-mel-spec (Melspectrog (None, 64, 128, 1)   1083456     in_layer[0][0]                   
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 64, 128, 64)  3200        in_layer[0][0]                  
__________________________________________________________________________________________________
... and the same other stuff

Finally here is a validation metric plotted where the gray line is the kapre one

keunwoochoi commented 6 years ago

Hi, so it's not even transferring a model but learning a new model only after changing the pre-computed melgram to on-the-fly-with-kapre ones? then I think the range could matter. Have you tried with putting BN after kapre melgram layer? In my experiences, the non-zero-mean inputs sometimes didn't work very well.

jake-g commented 6 years ago

Well im not trying to reuse my weights from the original model, i'm retraining from scratch but still expect a model of similar performance to be learned since the input is very similar. Ill try the BN layer and report back, thanks

UPDATE: added BN after melgram layer, before the first conv with axis=3 (the channel axis) The same issue occurred, pretty much an identical training/validation curve as above. My original model, with and without precomputed melgram frequency normalization trains much better.

jake-g commented 6 years ago

Fixed! It was indeed a me problem

Turns out the way i was creating the pickled raw audio numpy arrays was flawed and so when passed to the Melspectrograms layer, the result was pretty much all -inf or -80dB. Which obviously wouldn't train well.

When investigating and plotting the kapre melgram output in my original post, i loaded the wav file directly, so my flaw mentioned above was avoided.

Anyways, it looks like the kapre version is training very similar to my original model

keunwoochoi commented 6 years ago

Glad to hear that :)

keunwoochoi / kapre

Spectrogram integration issues #36