mravanelli / SincNet

SincNet is a neural architecture for efficiently processing raw audio samples.
MIT License
1.13k stars 262 forks source link

Frequency response of sinc filters #79

Closed zehaitu closed 4 years ago

zehaitu commented 4 years ago

Hi,

We are very interested in the frequency response changing phenomenon for phoneme classification when a narrow band noise is added to the wave. However, we could not reproduce the curves in the paper.

Here is what we achieved. Even though there is a clear valley from 2k to 2.5k Hz, a huge decay around 500 Hz always exists, which is different from the figures in your paper. Besides, this 500 Hz valley exists even when no noise is added to the input. And we could not find the reason behind it.

frequency response

Do you have any idea what might cause this problem? Or is it possible for you to release the frequency response code?

Many thanks!

mravanelli commented 4 years ago

Hi, in your case you are doing speech recognition, right? In the first SincNet paper (https://arxiv.org/abs/1808.00158) you can find results for speaker recognition only. In a follow up study ( https://arxiv.org/pdf/1811.09725.pdf) we also did some speech recognition experiments. In particular, in Fig. 5 you find the cumulative frequency response for a phoneme classification with TIMIT (english). Note that in the figure there is a hole in the cumulative frequency response due to the noise added in the bandwidth between 2000 and 2500 Hz. We didn't observe many peaks for speech recognition like for the speaker recognition. We just allocate more filters in the lower part of the spectrum. Note also that the filters learned by SincNet might change on different tasks, languages, etc and this is in part the strength of this approach.

Best,

Mirco

On Mon, 9 Dec 2019 at 08:31, zehaitu notifications@github.com wrote:

Hi,

We are very interested in the frequency response changing phenomenon for phoneme classification when a narrow band noise is added to the wave. However, we could not reproduce the curves in the paper.

Here is what we achieved. Even though there is a clear valley from 2k to 2.5k Hz, a huge decay around 500 Hz always exists, which is different from the figures in your paper. Besides, this 500 Hz valley exists even when no noise is added to the input. And we could not find the reason behind it.

[image: frequency response] https://user-images.githubusercontent.com/57004778/70439282-915e3980-1a87-11ea-9a3b-32cb916931e3.png

Do you have any idea what might cause this problem? Or is it possible for you to release the frequency response code?

Many thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/79?email_source=notifications&email_token=AEA2ZVUT6AYC6QOHUQJH2GLQXZCEJA5CNFSM4JYJHCF2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H7COD5A, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA2ZVUV2RETK5EOQ7QFZPDQXZCEJANCNFSM4JYJHCFQ .

zehaitu commented 4 years ago

Hi,

Thank you so much for your reply. Currently, I am actually trying to replicate that curve in https://arxiv.org/pdf/1811.09725.pdf, thus I am using TIMIT and the noise is in the the bandwidth between 2000 and 2500 Hz. Other setups like filter length, number of filters are the same as the open source code for SincNet for phoneme classification with in the pytorch-kaldi project. However, I found this strange valley around 500 Hz.

By allocating more filters in the lower part of the spectrum, do you mean artificially or the filters are learnt to move to lower frequency domain?

Many thanks!

mravanelli commented 4 years ago

Hi, it is the second one because normally the cumulative frequency response is higher in the lower frequency domain. The valley around 500 Hz is quite weird. We observed this kind of valleys/weird peaks just at the beginning of training, but then they should gradually disappear. How long have you trained the filters? What happens if you start the training from scratch?

Best,

Mirco

On Mon, 9 Dec 2019 at 10:36, zehaitu notifications@github.com wrote:

Hi,

Thank you so much for your reply. Currently, I am actually trying to replicate that curve in https://arxiv.org/pdf/1811.09725.pdf, thus I am using TIMIT and the noise is in the the bandwidth between 2000 and 2500 Hz. Other setups like filter length, number of filters are the same as the open source code for SincNet for phoneme classification with in the pytorch-kaldi project. However, I found this strange valley around 500 Hz.

By allocating more filters in the lower part of the spectrum, do you mean artificially or the filters are learnt to move to lower frequency domain?

Many thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/79?email_source=notifications&email_token=AEA2ZVWC54BBKFJPKLWLCM3QXZQWVA5CNFSM4JYJHCF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGJTJQI#issuecomment-563295425, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA2ZVTLM7AAE7WW4FEK7ODQXZQWVANCNFSM4JYJHCFQ .

zehaitu commented 4 years ago

Hi Mirco,

This figure is from the model saved after 20th epoch. I checked the models from 5th, 10th 15th and 20th epoch, and they are pretty much the same. And the network tends to overfit after 20 epochs so I did not checked the following models. I have not tried to start training from scratch so far but I guess I should have a try and see what will happen. Thanks for your advice!

Kavchch commented 3 years ago

Is it possible to plot cumulative frequency plot from learnt filters of 1st convolutional layer where we have passed Mel spectrogram as input feature to the first layer instead of raw waveform?