Further elaboration on Permute/Reshape for CNN/RNN stack

faroit commented 7 years ago

finally had some time to dig deeper into your auto-tagging network and compared it to what I am currently using for classifying music. I can confirm that CRNNs work nice, but I stacked them a bit differently. My main question is about how you actually connect the CNN and RNN layers.

Let me explain it closely related to your layers/code. These are the main ingredients for your network:

input             -> (nb_samples, 1, time, bins)
conv              -> (nb_samples, nb_filters, new_time, new_bins)
permute(2, 1, 3)  -> (nb_samples, new_time, nb_filters, new_bins)
reshape(15, 128)  -> (nb_samples, new_time * new_bins, nb_filters)

So essentially your RNN sees time*bins sequences of the filters.

Can you elaborate on your decision for your permutation and reshape?
Are there any other references you've found with a similar stacking architecture?
Why didn't you keep the time dimension as is and reshape to (nb_samples, new_time, new_bins*nb_filters)?

Another minor question: while comparing your code I wanted to adjust the reshape to my input, and thought: "hey, lets use an unknown dimension" so I don't have to calculate the exact shape from the pooling operations. Unfortunately Reshape((-1, 128)) does not work. Do you know a workaround in keras?

keunwoochoi commented 7 years ago

Hm, I am understanding it as this:

input : (nb_samples, 1, n_freq, n_time) # it's freq and time... not sure it allies with other's convention. 
conv's : (nb_samples, nb_filters=128, n_freq=1, n_time=15)
permute((3, 1, 2)) : (nb_samples, n_time, nb_filters, n_freq)
reshape((15, 128)) : (nb_samples, 15, 128)

So I can then answer but probably after time/frequency confusion is solved, no question would come.

I think this permutation and reshape is necessary to make them as a time-series representation.
I know there are conv + RNN structures in visions and nlp, but haven't seen the same in music except this.
That is what i am doing.
Sadly not - as far as I know.

faroit commented 7 years ago

Thanks for your reply. Still not clear to me, though, sorry ;-)

conv's : (nb_samples, nb_filters=128, n_freq=1, n_time=15)

does this mean you are actually not using the (3, 3) filters as in this repos code but instead filter only over time dimension? Have you tried using 1d convolution instead, then?

reshape((15, 128)) : (nb_samples, 15, 128)

so, again: can you confirm that 15 = n_time * n_freq ? I guess there is some missunderstanding here: Lets calculate (taken from your code)

input_shape = (1, 96, 1366)
# time: 1366
# mel freq: 96
...
x = ZeroPadding2D(padding=(0, 37))(melgram_input)
# time: 37 + 1366 + 37 = 1440
# freq: 96

...
x = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), name='pool1')(x)
# time: 1440 / 2 = 720
# freq: 96 / 2 = 48 
...
x = MaxPooling2D(pool_size=(3, 3), strides=(3, 3), name='pool2')(x)
# time: 720 / 3 = 240
# freq: 48 / 3 = 16
...
x = MaxPooling2D(pool_size=(4, 4), strides=(4, 4), name='pool3')(x)
# time 240 / 4 = 60
# freq: 16 / 4 = 4
...
x = MaxPooling2D(pool_size=(4, 4), strides=(4, 4), name='pool4')(x)
# time: 60 / 4 = 15
# freq: 4 / 4 = 1
...
x = Permute((3, 1, 2))(x)
x = Reshape((15, 128))(x)

that would actually mean that that n_freq in your example is 1 So this means it's not clear how your reshape operation is meant to be.... squash down time or frequency? This is ecause your pooling/downsampling merges the frequency dimension exactly to 1. Is this intentional?

keunwoochoi commented 7 years ago

No problem ;)

I used (3, 3) filter in this repo, and also in my papers. I also tried 1D convolutions as compared, link.

15 = n_time * n_freq = n_time * 1, as n_freq becomes 1 - which is intentional.

Reshape is just to reduce dimensionality.

faroit commented 7 years ago

15 = n_time n_freq = n_time 1, as n_freq becomes 1 - which is intentional.

okay, so when the frequency resolution would change you would adjust the pooling that it always is squashed to exactly 1? What happened if it's not 1 but higher? My intuition would therefore reshape to (n_time, n_freq*nb_filters)

Reshape is just to reduce dimensionality.

hm, I don't agree since this is does also change what the recurrent layer sees as step/sequence dimension. Therefore if you would not squash down to 1 you would increase the number of time steps for your recurrent layer.

keunwoochoi commented 7 years ago

okay, so when the frequency resolution would change you would adjust the pooling that it always is squashed to exactly 1?

I'd try that but that's not necessary.

What happened if it's not 1 but higher? My intuition would therefore reshape to (n_time, n_freq*nb_filters)

I agree with you. It doesn't need to be 1 and yes, that should be the way.

hm, ..

What I meant is when n_freq==1. But as just mentioned, I'd increase the vector dimension rather than time steps.

faroit commented 7 years ago

I would love to make some more tests, but the the Reshape does not support -1 that makes it really cumbersome. I just opened an issue https://github.com/fchollet/keras/issues/4916. Maybe you want so support it 😄

faroit commented 7 years ago

The reshaping is fixed in keras now. You can adjust https://github.com/fchollet/keras/blob/master/keras/applications/music_tagger_crnn.py#L122 to use reshape((-1, ...))

keunwoochoi commented 7 years ago

Thanks. But the code still expects a fixed input shape (https://github.com/fchollet/keras/blob/master/keras/applications/music_tagger_crnn.py#L69), so more to fix to make it flexible to input shape...

faroit commented 7 years ago

Closing this for now. I will postpone any keras contributions until they have merged with tensorflow

keunwoochoi / music-auto_tagging-keras

Further elaboration on Permute/Reshape for CNN/RNN stack #9