About the ASP in the model

qinxiaoyi / Simple-Attention-Module-based-Speaker-Verification-with-Iterative-Noisy-Label-Detection

13 stars 2 forks source link

About the ASP in the model #1

Open Hunterhuan opened 3 years ago

Hunterhuan commented 3 years ago

according to the paper, "SIMPLE ATTENTION MODULE BASED SPEAKER VERIFICATION WITH ITERATIVE NOISY LABEL DETECTION", SimAM-ResNet34-ASP has 25.21M parameters which has 3.67M more than SimAM-ResNet34-GSP. I have implemented the correct SimAM-ResNet34-GSP which has 21.54M parameters. However, when I replace the GSP with ASP which is provided in this repository, the parameter number is 22.16 which is different from 25.21 in the paper.

self.attention = nn.Sequential( nn.Conv1d(2560, 128, kernel_size=1), nn.ReLU(), nn.BatchNorm1d(128), nn.Conv1d(128, 2560, kernel_size=1), nn.Softmax(dim=2), ) Is there anything different?

What's more, I cannot obtain the baseline results of resnet34, 0.851% on voxceleb O test set. So could u please give me a brief description of the optimizer parameters and training strategies?

Thank u very much!

qinxiaoyi commented 3 years ago

Thank you for your attention. I think the dimensions of acoustic features and speaker embedding we used are different. We adopt the 80-dimensional Mel-filterbank and the speaker embedding dimension is 256. If we feet the [1,1,80,100] acoustic feature into the neural network, the output size of feature maps is [1, 512, 10, 13]. Therefore, the parameters of self.attention should be

self.attention = nn.Sequential( nn.Conv1d(5120, 128, kernel_size=1), nn.ReLU(), nn.BatchNorm1d(128), nn.Conv1d(128, 5120, kernel_size=1), nn.Softmax(dim=2), )

I guess the dimensions of your acoustic feature and speaker embedding are 40 and 128, respectively.

About the training strategy, the training detail is reported in our paper and reference papers. On the other side, we adopt the VoxCeleb(cleaned) trial (as reported in Section 4.1.1).

Hope the above could help you.

Hunterhuan commented 3 years ago

Thank you for your detailed answer. I have implemented the correct model.

However, I still didn't reproduce the results correctly. The optimizer I use is SGD and learning rate is decaying from 0.01 to 0.00001 exponentially. I guess it's the training strategy that leads to serious over-fitting of the model. o(╥﹏╥)o

According to the paper, "The detail of other training strategy, hyper parameters and models configuration follows [21, 20]." In [21], it's a technique report about the systems for voxsrc and the track 1 is speaker verification. However, I cannot find any description about the optimizer and learning rate. In [20], it's a technique report for sdsvc. Similarly, I still can't find the relevant description.

If possible, could you please tell me more about the training details? Thank u very much!

Looking forward to your reply. kind regards.