pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
6.38k stars 784 forks source link

Hi, I'm currently trying to use an updated wespeaker voice model like the one shown in the picture, but when I follow the file pyannote/audio/models/embedding/wespeaker/convert.py I can't adapt it, it shows the following error, how do I change ? #1772

Open LiLiWangzz opened 1 month ago

LiLiWangzz commented 1 month ago
          Hi, I'm currently trying to use an updated wespeaker voice model like the one shown in the picture, but when I follow the file pyannote/audio/models/embedding/wespeaker/convert.py I can't adapt it, it shows the following error, how do I change ?

@hbredin WESPEAKER raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for ResNet: Missing key(s) in state_dict: "conv1.weight", "bn1.weight", "bn1.bias", "bn1.running_mean", "bn1.running_var", "layer1.0.conv1.weight", "layer1.0.bn1.weight", "layer1.0.bn1.bias", "layer1.0.bn1.running_mean", "layer1.0.bn1.running_var", "layer1.0.conv2.weight", "layer1.0.bn2.weight", "layer1.0.bn2.bias", "layer1.0.bn2.running_mean", "layer1.0.bn2.running_var", "layer1.1.conv1.weight", "layer1.1.bn1.weight", "layer1.1.bn1.bias", "layer1.1.bn1.running_mean", "layer1.1.bn1.running_var", "layer1.1.conv2.weight", "layer1.1.bn2.weight", "layer1.1.bn2.bias", "layer1.1.bn2.running_mean", "layer1.1.bn2.running_var", "layer1.2.conv1.weight", "layer1.2.bn1.weight", "layer1.2.bn1.bias", "layer1.2.bn1.running_mean", "layer1.2.bn1.running_var", "layer1.2.conv2.weight", "layer1.2.bn2.weight", "layer1.2.bn2.bias", "layer1.2.bn2.running_mean", "layer1.2.bn2.running_var", "layer2.0.conv1.weight", "layer2.0.bn1.weight", "layer2.0.bn1.bias", "layer2.0.bn1.running_mean", "layer2.0.bn1.running_var", "layer2.0.conv2.weight", "layer2.0.bn2.weight", "layer2.0.bn2.bias", "layer2.0.bn2.running_mean", "layer2.0.bn2.running_var", "layer2.0.shortcut.0.weight", "layer2.0.shortcut.1.weight", "layer2.0.shortcut.1.bias", "layer2.0.shortcut.1.running_mean", "layer2.0.shortcut.1.running_var", "layer2.1.conv1.weight", "layer2.1.bn1.weight", "layer2.1.bn1.bias", "layer2.1.bn1.running_mean", "layer2.1.bn1.running_var", "layer2.1.conv2.weight", "layer2.1.bn2.weight", "layer2.1.bn2.bias", "layer2.1.bn2.running_mean", "layer2.1.bn2.running_var", "layer2.2.conv1.weight", "layer2.2.bn1.weight", "layer2.2.bn1.bias", "layer2.2.bn1.running_mean", "layer2.2.bn1.running_var", "layer2.2.conv2.weight", "layer2.2.bn2.weight", "layer2.2.bn2.bias", "layer2.2.bn2.running_mean", "layer2.2.bn2.running_var", "layer2.3.conv1.weight", "layer2.3.bn1.weight", "layer2.3.bn1.bias", "layer2.3.bn1.running_mean", "layer2.3.bn1.running_var", "layer2.3.conv2.weight", "layer2.3.bn2.weight", "layer2.3.bn2.bias", "layer2.3.bn2.running_mean", "layer2.3.bn2.running_var", "layer3.0.conv1.weight", "layer3.0.bn1.weight", "layer3.0.bn1.bias", "layer3.0.bn1.running_mean", "layer3.0.bn1.running_var", "layer3.0.conv2.weight", "layer3.0.bn2.weight", "layer3.0.bn2.bias", "layer3.0.bn2.running_mean", "layer3.0.bn2.running_var", "layer3.0.shortcut.0.weight", "layer3.0.shortcut.1.weight", "layer3.0.shortcut.1.bias", "layer3.0.shortcut.1.running_mean", "layer3.0.shortcut.1.running_var", "layer3.1.conv1.weight", "layer3.1.bn1.weight", "layer3.1.bn1.bias", "layer3.1.bn1.running_mean", "layer3.1.bn1.running_var", "layer3.1.conv2.weight", "layer3.1.bn2.weight", "layer3.1.bn2.bias", "layer3.1.bn2.running_mean", "layer3.1.bn2.running_var", "layer3.2.conv1.weight", "layer3.2.bn1.weight", "layer3.2.bn1.bias", "layer3.2.bn1.running_mean", "layer3.2.bn1.running_var", "layer3.2.conv2.weight", "layer3.2.bn2.weight", "layer3.2.bn2.bias", "layer3.2.bn2.running_mean", "layer3.2.bn2.running_var", "layer3.3.conv1.weight", "layer3.3.bn1.weight", "layer3.3.bn1.bias", "layer3.3.bn1.running_mean", "layer3.3.bn1.running_var", "layer3.3.conv2.weight", "layer3.3.bn2.weight", "layer3.3.bn2.bias", "layer3.3.bn2.running_mean", "layer3.3.bn2.running_var", "layer3.4.conv1.weight", "layer3.4.bn1.weight", "layer3.4.bn1.bias", "layer3.4.bn1.running_mean", "layer3.4.bn1.running_var", "layer3.4.conv2.weight", "layer3.4.bn2.weight", "layer3.4.bn2.bias", "layer3.4.bn2.running_mean", "layer3.4.bn2.running_var", "layer3.5.conv1.weight", "layer3.5.bn1.weight", "layer3.5.bn1.bias", "layer3.5.bn1.running_mean", "layer3.5.bn1.running_var", "layer3.5.conv2.weight", "layer3.5.bn2.weight", "layer3.5.bn2.bias", "layer3.5.bn2.running_mean", "layer3.5.bn2.running_var", "layer4.0.conv1.weight", "layer4.0.bn1.weight", "layer4.0.bn1.bias", "layer4.0.bn1.running_mean", "layer4.0.bn1.running_var", "layer4.0.conv2.weight", "layer4.0.bn2.weight", "layer4.0.bn2.bias", "layer4.0.bn2.running_mean", "layer4.0.bn2.running_var", "layer4.0.shortcut.0.weight", "layer4.0.shortcut.1.weight", "layer4.0.shortcut.1.bias", "layer4.0.shortcut.1.running_mean", "layer4.0.shortcut.1.running_var", "layer4.1.conv1.weight", "layer4.1.bn1.weight", "layer4.1.bn1.bias", "layer4.1.bn1.running_mean", "layer4.1.bn1.running_var", "layer4.1.conv2.weight", "layer4.1.bn2.weight", "layer4.1.bn2.bias", "layer4.1.bn2.running_mean", "layer4.1.bn2.running_var", "layer4.2.conv1.weight", "layer4.2.bn1.weight", "layer4.2.bn1.bias", "layer4.2.bn1.running_mean", "layer4.2.bn1.running_var", "layer4.2.conv2.weight", "layer4.2.bn2.weight", "layer4.2.bn2.bias", "layer4.2.bn2.running_mean", "layer4.2.bn2.running_var", "seg_1.weight", "seg_1.bias". Unexpected key(s) in state_dict: "front.conv1.weight", "front.bn1.weight", "front.bn1.bias", "front.bn1.running_mean", "front.bn1.running_var", "front.bn1.num_batches_tracked", "front.layer1.0.conv1.weight", "front.layer1.0.bn1.weight", "front.layer1.0.bn1.bias", "front.layer1.0.bn1.running_mean", "front.layer1.0.bn1.running_var", "front.layer1.0.bn1.num_batches_tracked", "front.layer1.0.conv2.weight", "front.layer1.0.bn2.weight", "front.layer1.0.bn2.bias", "front.layer1.0.bn2.running_mean", "front.layer1.0.bn2.running_var", "front.layer1.0.bn2.num_batches_tracked", "front.layer1.1.conv1.weight", "front.layer1.1.bn1.weight", "front.layer1.1.bn1.bias", "front.layer1.1.bn1.running_mean", "front.layer1.1.bn1.running_var", "front.layer1.1.bn1.num_batches_tracked", "front.layer1.1.conv2.weight", "front.layer1.1.bn2.weight", "front.layer1.1.bn2.bias", "front.layer1.1.bn2.running_mean", "front.layer1.1.bn2.running_var", "front.layer1.1.bn2.num_batches_tracked", "front.layer1.2.conv1.weight", "front.layer1.2.bn1.weight", "front.layer1.2.bn1.bias", "front.layer1.2.bn1.running_mean", "front.layer1.2.bn1.running_var", "front.layer1.2.bn1.num_batches_tracked", "front.layer1.2.conv2.weight", "front.layer1.2.bn2.weight", "front.layer1.2.bn2.bias", "front.layer1.2.bn2.running_mean", "front.layer1.2.bn2.running_var", "front.layer1.2.bn2.num_batches_tracked", "front.layer2.0.conv1.weight", "front.layer2.0.bn1.weight", "front.layer2.0.bn1.bias", "front.layer2.0.bn1.running_mean", "front.layer2.0.bn1.running_var", "front.layer2.0.bn1.num_batches_tracked", "front.layer2.0.conv2.weight", "front.layer2.0.bn2.weight", "front.layer2.0.bn2.bias", "front.layer2.0.bn2.running_mean", "front.layer2.0.bn2.running_var", "front.layer2.0.bn2.num_batches_tracked", "front.layer2.0.downsample.0.weight", "front.layer2.0.downsample.1.weight", "front.layer2.0.downsample.1.bias", "front.layer2.0.downsample.1.running_mean", "front.layer2.0.downsample.1.running_var", "front.layer2.0.downsample.1.num_batches_tracked", "front.layer2.1.conv1.weight", "front.layer2.1.bn1.weight", "front.layer2.1.bn1.bias", "front.layer2.1.bn1.running_mean", "front.layer2.1.bn1.running_var", "front.layer2.1.bn1.num_batches_tracked", "front.layer2.1.conv2.weight", "front.layer2.1.bn2.weight", "front.layer2.1.bn2.bias", "front.layer2.1.bn2.running_mean", "front.layer2.1.bn2.running_var", "front.layer2.1.bn2.num_batches_tracked", "front.layer2.2.conv1.weight", "front.layer2.2.bn1.weight", "front.layer2.2.bn1.bias", "front.layer2.2.bn1.running_mean", "front.layer2.2.bn1.running_var", "front.layer2.2.bn1.num_batches_tracked", "front.layer2.2.conv2.weight", "front.layer2.2.bn2.weight", "front.layer2.2.bn2.bias", "front.layer2.2.bn2.running_mean", "front.layer2.2.bn2.running_var", "front.layer2.2.bn2.num_batches_tracked", "front.layer2.3.conv1.weight", "front.layer2.3.bn1.weight", "front.layer2.3.bn1.bias", "front.layer2.3.bn1.running_mean", "front.layer2.3.bn1.running_var", "front.layer2.3.bn1.num_batches_tracked", "front.layer2.3.conv2.weight", "front.layer2.3.bn2.weight", "front.layer2.3.bn2.bias", "front.layer2.3.bn2.running_mean", "front.layer2.3.bn2.running_var", "front.layer2.3.bn2.num_batches_tracked", "front.layer3.0.conv1.weight", "front.layer3.0.bn1.weight", "front.layer3.0.bn1.bias", "front.layer3.0.bn1.running_mean", "front.layer3.0.bn1.running_var", "front.layer3.0.bn1.num_batches_tracked", "front.layer3.0.conv2.weight", "front.layer3.0.bn2.weight", "front.layer3.0.bn2.bias", "front.layer3.0.bn2.running_mean", "front.layer3.0.bn2.running_var", "front.layer3.0.bn2.num_batches_tracked", "front.layer3.0.downsample.0.weight", "front.layer3.0.downsample.1.weight", "front.layer3.0.downsample.1.bias", "front.layer3.0.downsample.1.running_mean", "front.layer3.0.downsample.1.running_var", "front.layer3.0.downsample.1.num_batches_tracked", "front.layer3.1.conv1.weight", "front.layer3.1.bn1.weight", "front.layer3.1.bn1.bias", "front.layer3.1.bn1.running_mean", "front.layer3.1.bn1.running_var", "front.layer3.1.bn1.num_batches_tracked", "front.layer3.1.conv2.weight", "front.layer3.1.bn2.weight", "front.layer3.1.bn2.bias", "front.layer3.1.bn2.running_mean", "front.layer3.1.bn2.running_var", "front.layer3.1.bn2.num_batches_tracked", "front.layer3.2.conv1.weight", "front.layer3.2.bn1.weight", "front.layer3.2.bn1.bias", "front.layer3.2.bn1.running_mean", "front.layer3.2.bn1.running_var", "front.layer3.2.bn1.num_batches_tracked", "front.layer3.2.conv2.weight", "front.layer3.2.bn2.weight", "front.layer3.2.bn2.bias", "front.layer3.2.bn2.running_mean", "front.layer3.2.bn2.running_var", "front.layer3.2.bn2.num_batches_tracked", "front.layer3.3.conv1.weight", "front.layer3.3.bn1.weight", "front.layer3.3.bn1.bias", "front.layer3.3.bn1.running_mean", "front.layer3.3.bn1.running_var", "front.layer3.3.bn1.num_batches_tracked", "front.layer3.3.conv2.weight", "front.layer3.3.bn2.weight", "front.layer3.3.bn2.bias", "front.layer3.3.bn2.running_mean", "front.layer3.3.bn2.running_var", "front.layer3.3.bn2.num_batches_tracked", "front.layer3.4.conv1.weight", "front.layer3.4.bn1.weight", "front.layer3.4.bn1.bias", "front.layer3.4.bn1.running_mean", "front.layer3.4.bn1.running_var", "front.layer3.4.bn1.num_batches_tracked", "front.layer3.4.conv2.weight", "front.layer3.4.bn2.weight", "front.layer3.4.bn2.bias", "front.layer3.4.bn2.running_mean", "front.layer3.4.bn2.running_var", "front.layer3.4.bn2.num_batches_tracked", "front.layer3.5.conv1.weight", "front.layer3.5.bn1.weight", "front.layer3.5.bn1.bias", "front.layer3.5.bn1.running_mean", "front.layer3.5.bn1.running_var", "front.layer3.5.bn1.num_batches_tracked", "front.layer3.5.conv2.weight", "front.layer3.5.bn2.weight", "front.layer3.5.bn2.bias", "front.layer3.5.bn2.running_mean", "front.layer3.5.bn2.running_var", "front.layer3.5.bn2.num_batches_tracked", "front.layer4.0.conv1.weight", "front.layer4.0.bn1.weight", "front.layer4.0.bn1.bias", "front.layer4.0.bn1.running_mean", "front.layer4.0.bn1.running_var", "front.layer4.0.bn1.num_batches_tracked", "front.layer4.0.conv2.weight", "front.layer4.0.bn2.weight", "front.layer4.0.bn2.bias", "front.layer4.0.bn2.running_mean", "front.layer4.0.bn2.running_var", "front.layer4.0.bn2.num_batches_tracked", "front.layer4.0.downsample.0.weight", "front.layer4.0.downsample.1.weight", "front.layer4.0.downsample.1.bias", "front.layer4.0.downsample.1.running_mean", "front.layer4.0.downsample.1.running_var", "front.layer4.0.downsample.1.num_batches_tracked", "front.layer4.1.conv1.weight", "front.layer4.1.bn1.weight", "front.layer4.1.bn1.bias", "front.layer4.1.bn1.running_mean", "front.layer4.1.bn1.running_var", "front.layer4.1.bn1.num_batches_tracked", "front.layer4.1.conv2.weight", "front.layer4.1.bn2.weight", "front.layer4.1.bn2.bias", "front.layer4.1.bn2.running_mean", "front.layer4.1.bn2.running_var", "front.layer4.1.bn2.num_batches_tracked", "front.layer4.2.conv1.weight", "front.layer4.2.bn1.weight", "front.layer4.2.bn1.bias", "front.layer4.2.bn1.running_mean", "front.layer4.2.bn1.running_var", "front.layer4.2.bn1.num_batches_tracked", "front.layer4.2.conv2.weight", "front.layer4.2.bn2.weight", "front.layer4.2.bn2.bias", "front.layer4.2.bn2.running_mean", "front.layer4.2.bn2.running_var", "front.layer4.2.bn2.num_batches_tracked", "pooling.attention.0.weight", "pooling.attention.0.bias", "pooling.attention.2.weight", "pooling.attention.2.bias", "pooling.attention.2.running_mean", "pooling.attention.2.running_var", "pooling.attention.2.num_batches_tracked", "pooling.attention.3.weight", "pooling.attention.3.bias", "bottleneck.weight", "bottleneck.bias".

Originally posted by @LiLiWangzz in https://github.com/pyannote/pyannote-audio/issues/1590#issuecomment-2406894842

clement-pages commented 1 month ago

Hey @LiLiWangzz, pyannote/audio/models/embedding/wespeaker/convert.py is not dedicated to that. Furthermore, SimAMResNetxx is not currently supported by pyannote, but feel free to open a pull request.