mravanelli / SincNet

SincNet is a neural architecture for efficiently processing raw audio samples.
MIT License
1.11k stars 260 forks source link

Taking the whole speech sequence as input without chunking #39

Closed bwang482 closed 5 years ago

bwang482 commented 5 years ago

Hi Micro, many thanks for the great work and sharing the code! It's super useful for the work I am doing. So for the speaker identification task, each training sample has the length of 3200 since fs=16000; cw_len=200; wlen=int(fs*cw_len/1000.00)=3200. And for testing, voting over even smaller chunks are performed.

I have a speech classification task that, I think, it's the best to take the whole speech sequence as input without chunking. Currently, is it possible to use variable length input sequence for sincnet at all? If not, then I would pad each batch with zeros to the max length (in each batch). Would sincnet be affected with padded batches?

Oh btw, my speech sequences vary in length by a lot actually. What would be your suggestion Micro? Many thanks again!

mravanelli commented 5 years ago

Hi, thank you very much for the interest in my work! Given the variable-length input sequence, do you want in output a fixed length vector or a variable length sequence? Which classification task are you considering?

Best,

Mirco

On Tue, 30 Apr 2019 at 07:03, bluemonk482 notifications@github.com wrote:

Hi Micro, many thanks for the great work and sharing the code! It's super useful for the work I am doing. So for the speaker identification task, each training sample has the length of 3200 since fs=16000; cw_len=200; wlen=int(fs*cw_len/1000.00)=3200. And for testing, voting over even smaller chunks are performed.

I have a speech classification task that, I think, it's the best to take the whole speech sequence as input without chunking. Currently, is it possible to use variable length input sequence for sincnet at all? If not, then I would pad each batch with zeros to the max length (in each batch). Would sincnet be affected with padded batches?

Oh btw, my speech sequences vary in length by a lot actually. What would be your suggestion Micro? Many thanks again!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/39, or mute the thread https://github.com/notifications/unsubscribe-auth/AEA2ZVRMFPUCVN4XQMTZHWDPTARPPANCNFSM4HJKSUZQ .

bwang482 commented 5 years ago

@mravanelli Speech emotion classification, 5-6 categorical emotion classes, which I think it's best not to chunk right.

Actually whether or not the output has a fixed or variable length, does not matter much, since I have a layer after sincnet that outputs fixed length vector representation over the whole input sequence.

mravanelli commented 5 years ago

Ok, chunking is normally a good idea. However, if you would like to avoid chunking you can just give in input to the convolutional neural network a sequence of samples and you have in output another sequence whose length and dimensionality depends on the hyperparameters of the convolutional layers (filter length, number of filters,...). The time context used by each prediction depends on the receptive field of the convolutional neural network (more depth, more context). If you would like an implementation more similar to the latter case, you might want to take a look here: https://github.com/santi-pdp/pase

Best,

Mirco

On Tue, 30 Apr 2019 at 11:31, bluemonk482 notifications@github.com wrote:

@mravanelli https://github.com/mravanelli Speech emotion classification, 5-6 categorical emotion classes, which I think it's best not to chunk right.

Actually whether or not the output has a fixed or variable length, does not matter much, since I have a layer after sincnet that outputs fixed length vector representation over the whole input sequence.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/39#issuecomment-487999635, or mute the thread https://github.com/notifications/unsubscribe-auth/AEA2ZVV3C3H3SSFZXUHJYZLPTBQ5LANCNFSM4HJKSUZQ .

bwang482 commented 5 years ago

Thanks @mravanelli !

Can I please ask why is chunking normally a good idea? Reading your code, you extract one cw_len long chunk from each original input sample. Therefore you are not creating more training data in this case, but kind of treating each small chunk as the representative of the original speech sample. It is kinda counter-intuitive if I may say so (for someone like me who has little experience in speech processing at least).

bwang482 commented 5 years ago

Actually my bad. You're actually creating more training samples in this case. I still don't understand why this would be a good idea in general.

mravanelli commented 5 years ago

Actually, the difference between chunking/non-chunking is not that big. Even if you do not explicitly do chunking, you implicitly do it using the convolution operation and their limited receptive field. In our case, we have in input frames of 200 ms and we have three convolutional layers followed by a couple of fully connected layers. The advantage is that with a relatively shallow neural network (5 layers only) we can embed a pretty large context. If you would like to embed the same context without chunking and using convolutional layers only you need many more convolutional layers (because the receptive field increases with the depth of the network). Said that both solutions can work equally well.

Mirco

On Tue, 30 Apr 2019 at 12:04, bluemonk482 notifications@github.com wrote:

Actually my bad. You're actually creating more training samples in this case. I still don't understand why this would be a good idea in general.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/39#issuecomment-488012617, or mute the thread https://github.com/notifications/unsubscribe-auth/AEA2ZVQR3LZ7AUSTJF275TLPTBUZ7ANCNFSM4HJKSUZQ .

hdubey commented 5 years ago

Hi Mirco, Thanks for these suggestions. I want to run speaker ID on utterance level just for sake of comparing results. What changes in config file and code is need to run utterance-level speaker ID on TIMIT. Perhaps, global pooling before final layer or there is a better work around.

mravanelli commented 5 years ago

Actually, utterance-level speaker ID is what we are doing, because we average the predictions on all the chunks.

Mirco

On Sat, 4 May 2019 at 00:11, hdubey notifications@github.com wrote:

Hi Mirco, Thanks for these suggestions. I want to run speaker ID on utterance level just for sake of comparing results. What changes in config file and code is need to run utterance-level speaker ID on TIMIT. Perhaps, global pooling before final layer or there is a better work around.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/39#issuecomment-489291897, or mute the thread https://github.com/notifications/unsubscribe-auth/AEA2ZVTEMKE4PJUUT3LYF3LPTUEHFANCNFSM4HJKSUZQ .

hdubey commented 5 years ago

In my case, I want a max polling for computer vision papers, how to do that?

hdubey commented 5 years ago

@mravanelli If I am not mistaken the network trains on frame-level (10ms) right? I want to do global max-pooling just before the softmax layer and train on utterance-level Loss function.

mravanelli commented 5 years ago

Hi! the version implemented at the link below is closer to what you would like to do. https://github.com/santi-pdp/pase