mravanelli / SincNet

SincNet is a neural architecture for efficiently processing raw audio samples.
MIT License
1.11k stars 260 forks source link

How to divide the test set? #53

Closed Range0122 closed 4 years ago

Range0122 commented 5 years ago

Hello! Firstly, thanks for sharing the code of your paper, it's really a fantastic work! But I'm quite confused when I'm going to test my own model. When we are going to test the model in Speaker Identification, we should divide the test set into two parts, some of the data is used for enrollment and the others is used for test. For example, there's ten sentences of each speaker, maybe it's not appropriate to set nine of the sentences for enrollment and one for test, as the model may learn much from the nine sentence and it's easy to make a correct predition of the rest one during test. Thus, in this condition, the accuracy might be higher than it truly should be. But that's not what I want. I read your code carefully, but didn't find the answer, sorry about that. :( So could you please tell me the way to divide the test set?

mravanelli commented 5 years ago

Hi, I think using 80-90% of the material for training and the rest for test is commonly considered a good practice. Actually, in our librispeech/TMITI experiments we only used 10-15 seconds of training material for each speaker and the rest of the material for test. These datasets are rather easy for speaker-id and it is normal to have high accuracy even with few seconds of speech for each speaker. If you want to try your technique on a more realistic dataset, I would suggest using VoxCeleb, where you already find the standard splits to use.

Best,

Mirco

On Fri, 12 Jul 2019 at 05:02, Range notifications@github.com wrote:

Hello! Firstly, thanks for sharing the code of your paper, it's really a fantastic work! But I'm quite confused when I'm going to test my own model. When we are going to test the model in Speaker Identification, we should divide the test set into two parts, some of the data is used for enrollment and the others is used for test. For example, there's ten sentences of each speaker, maybe it's not appropriate to set nine of the sentences for enrollment and one for test, as the model may learn much from the nine sentence and it's easy to make a correct predition of the rest one during test. Thus, in this condition, the accuracy might be higher than it truly should be. But that's not what I want. I read your code carefully, but didn't find the answer, sorry about that. :( So could you please tell me the way to divide the test set?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/53?email_source=notifications&email_token=AEA2ZVRS6UPSJ7HS6SORM7LP7BCCFA5CNFSM4ICFIIM2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G62PAVA, or mute the thread https://github.com/notifications/unsubscribe-auth/AEA2ZVTOK2C6HZBKTI7MTX3P7BCCFANCNFSM4ICFIIMQ .

Range0122 commented 5 years ago

Thanks for your response, but I think you might misunderstand me. For example, we have 10 speakers and each of them contains 10 sentences. So I select 8 speakers for training and 2 speakers for test. After training, the 2 speakers are unknown for our model while the 8 speakers used for training is known for it. Just aim at the 2 speakers in the test. You know, in speaker identification, there's a process named enrollment (or maybe annotation, :) ). We usually select some sentences for enrollment(or annotation.), and the rest for testing whether the model could give a right prediction. So the things I want to know is that, when I'm going to test my model with the 20(10x2=20) sentences of the 2 speakers, How to divide the 20 sentences into two parts? The result will be better if I select 9 sentences for enrollment and 1 for test than setting 5 for enrollment and 5 for test, because the model could learn more about the speaker from 9 sentences than 5. I hope I could describle it clear. :)

And another question is quite familiar with this: In speaker verification, how to divide the test set into speaker and non-speaker(or attacker)? XD

Best, Range

mravanelli commented 4 years ago

Hi, are you talking about speaker identification or speaker verification? In speaker verification you have training, enrollment, and test speakers (training speakers are different from enrollment and test ones), while in speaker identification you have training and test data (where test data comes from the same speakers of training).

Range0122 commented 4 years ago

Hi, I think I'm talking about speaker identification. So you mean that in speaker identification, after training the model with training set, we just use the different utterances from the same speaker to test? I consider it might be... somehow like a tricks?😂 In my point of view, the speakers of training set and test set should be different. So that there is supposed to a process of enrollment when we test the model. For example, there are 10 utterances of each speaker in the test set. And we set 6 for enrollnment and 4 for test. We use the utterances of the speakers which are different from the training set, because we want to make sure that our model could be in high accuracy even just input little data to it. If we just use the same speakers, it might overfit to the speakers and it's obviouly easy to get a good accuracy, but that's not what I want.

mravanelli commented 4 years ago

Hi, normally speaker identification is a close-set problem (i.e, choosing between one of the N speakers), while speaker verification (i.e., checking the identity of a claimed speaker) is open-set. Clearly, any close-set problem is much easier than open-set ones and this is the reason why speaker recognition systems are often evaluated in speaker verification settings (for some SincNet-based speaker verification performance on Voxceleb you can take a look into this paper " Learning Speaker Representations with Mutual Information"). Even if you have enrolment data to adapt your network as you pointed out, in the end, the problem is the same (you still classify a speaker within a close-set). In practice, I would suggest switching from speaker identification to speaker verification if your goal is to evaluate your technique in a much more challenging scenario.

Mirco

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Mon, 22 Jul 2019 at 08:52, Range notifications@github.com wrote:

Hi, I think I'm talking about speaker identification. So you mean that in speaker identification, after training the model with training set, we just use the different utterances from the same speaker to test? I consider it might be... somehow like a tricks?😂 In my point of view, the speakers of training set and test set should be different. So that there is supposed to a process of enrollment when we test the model. For example, there are 10 utterances of each speaker in the test set. And we set 6 for enrollnment and 4 for test. We use the utterances of the speakers which are different from the training set, because we want to make sure that our model could be in high accuracy even just input little data to it. If we just use the same speakers, it might overfit to the speakers and it's obviouly easy to get a good accuracy, but that's not what I want.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/53?email_source=notifications&email_token=AEA2ZVRUSER5UWSP5DGSF7TQAWUSTA5CNFSM4ICFIIM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2P2GDQ#issuecomment-513778446, or mute the thread https://github.com/notifications/unsubscribe-auth/AEA2ZVXVJILYDPR2UMBIRNLQAWUSTANCNFSM4ICFIIMQ .

Range0122 commented 4 years ago

Thank you very much, I got it!