mravanelli / SincNet

SincNet is a neural architecture for efficiently processing raw audio samples.
MIT License
1.14k stars 263 forks source link

Question about paper <LEARNING SPEAKER REPRESENTATIONS WITH MUTUAL INFORMATION> #11

Closed aflyingnoob closed 5 years ago

aflyingnoob commented 5 years ago

Hello, I've reading your paper and I'm a little curious about the calculation of mutual infomation. when transfer the MI to KL based method which contains a joint distribution and two marginal production.

but how to understand when we train the network, we sample (z1,z2)as the joint distribution, and sample (z1,zrand) as the other? what's the true joint distribution and two marginal distribution?Thanks.

mravanelli commented 5 years ago

Hi, thank you very much for the interest into my last paper on unsupervised learning with mutual information (https://arxiv.org/abs/1812.00271). You are right: (z_1,z_2) contains speech representations from the same speaker (i.e., is a sample from the joint distribution), while (z_1,z_rnd) contains two speech representations from different speakers (i.e., is a sample from the product of marginal distributions).

On Thu, Dec 13, 2018 at 3:49 PM zx notifications@github.com wrote:

Hello, I've reading your paper and I'm a little curious about the calculation of mutual infomation. when transfer the MI to KL based method which contains a joint distribution and two marginal production.

https://camo.githubusercontent.com/e62bc599836f741efd45c0c5d1f547c7dd5a6c89/687474703a2f2f63686172742e676f6f676c65617069732e636f6d2f63686172743f6368743d74782663686c3d5c4c61726765205c6d61746874747b4d497d287a5f312c7a5f3229203d205c6d61746874747b445f4b5f4c7d2870287a5f312c7a5f32297c7c70287a5f312970287a5f322929 but how to understand when we train the network, we sample (z1,z2)as the joint distribution, and sample (z1,zrand) as the other? what's the true joint distribution and two marginal distribution?Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/11, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1u5hwfa8wIkH1GIKjax1Ix5H6spVks5u4mjqgaJpZM4ZRx1K .

aflyingnoob commented 5 years ago

I also have a question about implement the work. are there any problem in my point of view ? when train the network :

  1. the positive and negative sample send to D and get 2 number .
  2. then use JS-based estimate to calculate the loss.
  3. multiply -1(pytorch) to update D and task network for maxmize simultaneously?
    • Thanks (^-^
mravanelli commented 5 years ago

Hi, in the following you find my remarks: 1 The discriminator outputs a single number ranging between 0 and 1 (binary classification only needs one output)

  1. The loss is the standard binary cross entropy. It can be proved that such a metric corresponds to the JS divergence.
  2. The game we play in the paper is cooperative and not adversarial. You don't have to multiply the gradient by -1

Mirco

On Fri, Dec 14, 2018 at 9:23 AM zx notifications@github.com wrote:

I also have a question about implement the work. are there any problem in my point of view ? when train the network :

  1. the positive and negative sample send to D and get 2 number .
  2. then use JS-based estimate to calculate the loss.
  3. multiply -1(pytorch) to update D and task network for maxmize simultaneously?

    • Thanks (^-^

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/11#issuecomment-447251194, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1navoMak2XTe8n3kF6R8DHBW1tTUks5u41_tgaJpZM4ZRx1K .

aflyingnoob commented 5 years ago

oh my god, I see the maximize loss, so in my first of view, loss*-1 to maximize, as the following loss is directly could be calculated by BCE loss(BCE loss have a additional multiply -1 compare to the loss function bellow) and directly backward and optim.step for maximize. thanks for your reply.

L(Θ,Φ) = EXp [log(g(z_1 ,z_2 ))]+EXn[log(1−g(z_1 ,z_rnd ))]

Lotrea commented 5 years ago

Hi, I'm reading your paper and curious about the experiments' results. Have you tried to add the sinc to other SOTA networks such as ResNet or SENet et al. How does it perform? Can I consider the sinc is a general way to improve the CNN's performance dealing with time series? Looking forward to your reply. Thanks~

mravanelli commented 5 years ago

Hi, thank you very much for your interest in my research. To better answer, let me distinguish between the architecture of the encoder (SincNet) and the method for learning speaker-id representations (i.e., the sampling strategy coupled with a discriminator that maximizes the mutual information). The first one is a general architecture for learning from audio and speech waveform directly (it can be used for both supervised or unsupervised learning). In particular, SincNet uses sinc-based filters in the first convolutional layer and, after that, the user can employ anything, including other fully connected (such as resnet) or recurrent layers (e.g., LSTM or GRU). The second one is a method that turned out to derive a representation that highlights very clearly speaker-id information. We are working now to extend this work in order to capture other interesting features of the speech signal, including for instance phonemes and prosody information.

On Mon, Dec 24, 2018 at 6:39 AM Weihao Li notifications@github.com wrote:

Hi, I'm reading your paper and curious about the experiments' results. Have you tried to add the to other SOTA networks such as ResNet or SENet et al. How does it perform? Can I consider the is a general way to improve the CNN's performance dealing with time series? Looking forward to your reply. Thanks~

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/11#issuecomment-449723459, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1uvIn3MvlykRRTvRRmtK-XHabjYDks5u8LzlgaJpZM4ZRx1K .

Lotrea commented 5 years ago

Hi, thank you very much for your interest in my research. To better answer, let me distinguish between the architecture of the encoder (SincNet) and the method for learning speaker-id representations (i.e., the sampling strategy coupled with a discriminator that maximizes the mutual information). The first one is a general architecture for learning from audio and speech waveform directly (it can be used for both supervised or unsupervised learning). In particular, SincNet uses sinc-based filters in the first convolutional layer and, after that, the user can employ anything, including other fully connected (such as resnet) or recurrent layers (e.g., LSTM or GRU). The second one is a method that turned out to derive a representation that highlights very clearly speaker-id information. We are working now to extend this work in order to capture other interesting features of the speech signal, including for instance phonemes and prosody information. On Mon, Dec 24, 2018 at 6:39 AM Weihao Li @.***> wrote: Hi, I'm reading your paper and curious about the experiments' results. Have you tried to add the to other SOTA networks such as ResNet or SENet et al. How does it perform? Can I consider the is a general way to improve the CNN's performance dealing with time series? Looking forward to your reply. Thanks~ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#11 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1uvIn3MvlykRRTvRRmtK-XHabjYDks5u8LzlgaJpZM4ZRx1K .

Thanks for your quick reply. I understand the sinc-based filters can capture the featrue from raw data with better interpretable. And it got better classification performance on TIMIT and LibriSpeech. But I'm not sure if it is universal for other time series classification problem when added to various neural networks. (i.e, vanilla GRU vs vanilla GRU with sinc-filter). Do you have any verification results?

Sorry for my previous question decription that when I use "<>" symbol to quote the "sinc" function, the word missed. I've edited my question. Thanks again.

mravanelli commented 5 years ago

I actually never tried with time series different from audio and speech. My feeling is that it could work even for other time of time series....

Mirco

On Mon, Dec 24, 2018 at 9:36 PM Weihao Li notifications@github.com wrote:

Hi, thank you very much for your interest in my research. To better answer, let me distinguish between the architecture of the encoder (SincNet) and the method for learning speaker-id representations (i.e., the sampling strategy coupled with a discriminator that maximizes the mutual information). The first one is a general architecture for learning from audio and speech waveform directly (it can be used for both supervised or unsupervised learning). In particular, SincNet uses sinc-based filters in the first convolutional layer and, after that, the user can employ anything, including other fully connected (such as resnet) or recurrent layers (e.g., LSTM or GRU). The second one is a method that turned out to derive a representation that highlights very clearly speaker-id information. We are working now to extend this work in order to capture other interesting features of the speech signal, including for instance phonemes and prosody information. … <#m-6311391848662784978> On Mon, Dec 24, 2018 at 6:39 AM Weihao Li @.***> wrote: Hi, I'm reading your paper and curious about the experiments' results. Have you tried to add the to other SOTA networks such as ResNet or SENet et al. How does it perform? Can I consider the is a general way to improve the CNN's performance dealing with time series? Looking forward to your reply. Thanks~ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#11 (comment) https://github.com/mravanelli/SincNet/issues/11#issuecomment-449723459>, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1uvIn3MvlykRRTvRRmtK-XHabjYDks5u8LzlgaJpZM4ZRx1K .

Thanks for your quick reply. I understand the sinc-based filters can capture the featrue from raw data with better interpretable. And it got better classification performance on TIMIT and LibriSpeech. But I'm not sure if it is universal for other time series classification problem when added to various neural networks. (i.e, vanilla GRU vs vanilla GRU with sinc-filter). Do you have any verification results?

Sorry for my previous question decription that when I use "<>" symbol to quote the "sinc" function, the word missed. I've edited my question. Thanks again.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/11#issuecomment-449790208, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1qHnKTxzqwMvqfKbC8FBcLXWcSbZks5u8Y9KgaJpZM4ZRx1K .

Lotrea commented 5 years ago

@mravanelli Thank you, I may try it. Merry Christmas~

mravanelli commented 5 years ago

Sure, please keep me updated!

Best,

Mirco

On Wed, Dec 26, 2018 at 10:33 AM Weihao Li notifications@github.com wrote:

@mravanelli https://github.com/mravanelli Thank you, I may try it. Merry Christmas~

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/11#issuecomment-449982390, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1nuzrOVKX11_OTdKk5LHPYXCe1Njks5u85bUgaJpZM4ZRx1K .

jiaxuehu commented 5 years ago

In your paper, NCE loss: L(Θ,Φ) = EX[ g(z_1 ,z_2 )-log((g(z_1 ,z_2 )+sigmaXn(exp(g(z_1 ,z_rnd )))) ] => L(Θ,Φ) = EX[ g(z_1 ,z_2 )-log(exp((g(z_1 ,z_2 ))+sigmaXn(exp(g(z_1 ,z_rnd )))) ] ,is that right?

mravanelli commented 5 years ago

yes, that's right.

On Tue, Jan 8, 2019 at 5:14 AM jiaxuehu notifications@github.com wrote:

In your paper, NCE loss: L(Θ,Φ) = EX[ g(z_1 ,z_2 )-log((g(z_1 ,z_2 )+sigmaXn(exp(g(z_1 ,z_rnd )))) ] => L(Θ,Φ) = EX[ g(z_1 ,z_2 )-log(exp((g(z_1 ,z_2 ))+sigmaXn(exp(g(z_1 ,z_rnd )))) ] ,is that right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/11#issuecomment-452245382, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1hI0orzgSYeXz98s9-B3Au84ltu2ks5vBG99gaJpZM4ZRx1K .

aflyingnoob commented 5 years ago

Hello ,the JS bound mutual information estimate between [0,1] for convenient max or min but not concern about its numerical value,But in JS bound estimate, Can we regard the 0 represent low infomation and 1 for high infomation? and the g can be just 2 or 3 fully connected layer to fitting a function?

mravanelli commented 5 years ago

"Can we regard the 0 represent low infomation and 1 for high infomation?" Yes, exact. "g can be just 2 or 3 fully connected layer to fitting a function? "In our case, the discriminator g is a single layer MLP. The idea is to employ a very simple discriminator to encourage the encoder to learn more meaningful representations.

Mirco

On Sat, Jan 19, 2019 at 8:29 AM zx notifications@github.com wrote:

Hello ,the JS bound mutual information estimate between [0,1] for convenient max or min but not concern about its numerical value,But in JS bound estimate, Can we regard the 0 represent low infomation and 1 for high infomation? and the g can be just 2 or 3 fully connected layer to fitting a function?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/11#issuecomment-455780474, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1qDGMSzIoDtnLcYecF7YLI7u3zFQks5vEx3FgaJpZM4ZRx1K .

c810armyHuan commented 5 years ago

Hi @mravanelli

In the comment https://github.com/mravanelli/SincNet/issues/11#issuecomment-447008806

(z_1,z_2) contains speech representations from the same speaker (i.e., is a sample from the joint distribution), while (z_1,z_rnd) contains two speech representations from different speakers

While in the paper it says

the positive samples (z_1, z_2) are simply derived by randomly sampling speech chunks from the same sentence.

In the stage of unsupervised learning, are the speaker ids of the utterances used to make sure all utterances in a batch are from different speakers? (so that all z_1 and z_rnd are from different speakers)

It seems that (z_1, z_2) and (z_1, z_rnd) are from the same distribution (both pairs corresponds to the chunks from one speaker) when all sentences in a batch correspond to the same speaker.

Do I misunderstand the pure unsupervised approach in the experiments from the paper? Does the unsupervised setting uses the speaker ids?

mravanelli commented 5 years ago

Hi, thank you very much for giving me the possibility to clarify this aspect (even from the current version of the paper is not 100% clear and I have to clarify it better into the possible camera-ready version on my work). Our approach is unsupervised because we don't actually use any speaker-id label. We only rely on the following (more than reasonable assumptions) that often arises in practice:

1- If I randomly sample two chunks from the same speech signal, these samples belong to the same speakers (i.e., the is only one speaker in each speech signal). This assumption naturally holds for most of the speech dataset we have (e.g, librispeech, TIMIT, ...). This naturally holds also in practice in many application. Think for instance to devices like Google home, Amazon Alexa where a certain interaction normally involves a single speaker at a time. The assumption could not hold if you have in input very long audio sequences where multiple speakers are speaking (e.g, a recording of a newcast). We didn't address the latter situation in the current paper, but if you constrain the system to take two random chunks that are reasonably close you likely sample from the same speakers and you can still learn something useful.

2- If we sample two random chunks from random utterances, they likely belong to different speakers. This is another assumption that arises in practice if you have a dataset large enough. Even with a small dataset like TIMIT (462 speakers), you have 0.997% of the probability to sample from different speakers. In the paper, we can thus have some samples composed of the same speaker, but these are very very rare that they do not harm the performance at all.

Hope my answers are clear enough to clarify this aspect of my work. Please, let me know if you have other questions or comments!

Best,

Mirco

On Thu, 24 Jan 2019 at 01:46, c810armyHuan notifications@github.com wrote:

Hi @mravanelli https://github.com/mravanelli

In the comment #11 (comment) https://github.com/mravanelli/SincNet/issues/11#issuecomment-447008806

(z_1,z_2) contains speech representations from the same speaker (i.e., is a sample from the joint distribution), while (z_1,z_rnd) contains two speech representations from different speakers

While in the paper it says

the positive samples (z_1, z_2) are simply derived by randomly sampling speech chunks from the same sentence.

In the stage of unsupervised learning, are the speaker ids of the utterances used to make sure all utterances in a batch are from different speakers? (so that all z_1 and z_rnd are from different speakers)

It seems that (z_1, z_2) and (z_1, z_rnd) are from the same distribution (both pairs corresponds to the chunks from one speaker) when all sentences in a batch correspond to the same speaker.

Do I misunderstand the pure unsupervised approach in the experiments from the paper? Does the unsupervised setting uses the speaker ids?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/11#issuecomment-457086110, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1hZrro9BtOikzK2q_F0ILQWTrvxuks5vGVbSgaJpZM4ZRx1K .

c810armyHuan commented 5 years ago

Hi @mravanelli

I've got it.

Thank you for clarifying that it's rare to find the noisy training data ( z in (z_1, z_rnd) from the same speaker ) when the number of speakers (462 in TIMIT) is much greater than the number of utterances per speaker in the training dataset.

gancx commented 1 year ago

Hi, thank you very much for the interest into my last paper on unsupervised learning with mutual information (https://arxiv.org/abs/1812.00271). You are right: (z_1,z_2) contains speech representations from the same speaker (i.e., is a sample from the joint distribution), while (z_1,z_rnd) contains two speech representations from different speakers (i.e., is a sample from the product of marginal distributions). On Thu, Dec 13, 2018 at 3:49 PM zx @.***> wrote: Hello, I've reading your paper and I'm a little curious about the calculation of mutual infomation. when transfer the MI to KL based method which contains a joint distribution and two marginal production. https://camo.githubusercontent.com/e62bc599836f741efd45c0c5d1f547c7dd5a6c89/687474703a2f2f63686172742e676f6f676c65617069732e636f6d2f63686172743f6368743d74782663686c3d5c4c61726765205c6d61746874747b4d497d287a5f312c7a5f3229203d205c6d61746874747b445f4b5f4c7d2870287a5f312c7a5f32297c7c70287a5f312970287a5f322929 but how to understand when we train the network, we sample (z1,z2)as the joint distribution, and sample (z1,zrand) as the other? what's the true joint distribution and two marginal distribution?Thanks. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#11>, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1u5hwfa8wIkH1GIKjax1Ix5H6spVks5u4mjqgaJpZM4ZRx1K .

Thanks for your work. It's really helpful for us to understand the framework. I just wonder where I can find the implementation of this paper. You mentioned you released the code in the Pytorch-Kaldi toolkit. But I don't find it. Thank you.