jymsuper / SpeakerRecognition_tutorial

Simple d-vector based Speaker Recognition (verification and identification) using Pytorch
MIT License
210 stars 46 forks source link

performance #4

Closed ooobsidian closed 4 years ago

ooobsidian commented 4 years ago

@jymsuper I want to know it can be verified (not be identified) on the open set? That is to say, the test speakers not in training dataset. If possible, I want to know performance.

jymsuper commented 4 years ago

Yes, it is possible. Actually, uploaded files for enrollment and test are all not in the training dataset. Uploaded wav files are all clean data, so the performance is quite good. If you want to test performance in more challenging conditions (more noisy or shorter utterance,...), you have to increase the amount of training data and model size. More advanced loss function or pooling method (attentive pooling...) also can be used.

ooobsidian commented 4 years ago

Thank you for your reply. I don't know how many speakers can ResNet-18 distinguish. Shall I change to a larger model? My training data has 855 speakers, so what do you suggest?

jymsuper commented 4 years ago

I think ResNet-34 is good for your condition. You can also make the model wider (increase the number of channels). The best way is to perform experiments with all of them If it is possible. In configure.py, NUM_WIN_SIZE (number of input frames) is set to 100. Increase this 200 or 300. As the training set in this tutorial is very small, I set all the settings according to the small dataset.

ooobsidian commented 4 years ago

Thank you very much for your help!!

ooobsidian commented 4 years ago

Hi @jymsuper , I use .npy as feature file, and I change line12 in SR_Dataset.py follow #3 , but I have some troubles when I run train.py.

Traceback (most recent call last):
  File "train.py", line 328, in <module>
    main()
  File "train.py", line 135, in main
    epoch, n_classes)
  File "train.py", line 175, in train
    for batch_idx, (data) in enumerate(train_loader):
  File "/root/miniconda3/envs/3.6.7/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/root/miniconda3/envs/3.6.7/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/root/miniconda3/envs/3.6.7/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/miniconda3/envs/3.6.7/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/source/speaker_recognition_pytorch/SR_Dataset.py", line 221, in __getitem__
    feature, label = self.loader(feat_path)
  File "/data/source/speaker_recognition_pytorch/SR_Dataset.py", line 16, in read_MFB
    feature = feat_and_label['feat']  # size : (n_frames, dim=40)
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

It occured at

And I sent you an email, please check it, thanks.

ooobsidian commented 4 years ago

@jymsuper I have changed above problems, I changed feature method in the process of serialization. But now, in train.py

transform = transforms.Compose([
        TruncatedInputfromMFB(),  # numpy array:(LICENSE, n_frames, n_dims)
        ToTensorInput()  # torch tensor:(LICENSE, n_dims, n_frames)
    ])

An error has occurred in method ToTensorInput() :

 File "/Users/obsidian/source/voiceprint_pytorch/SR_Dataset.py", line 127, in __call__
    (0, 2, 1))).float()  # output type => torch.FloatTensor, fast
ValueError: axes don't match array

Could you help me solve this problem? I have debug long time ☹️

jymsuper commented 4 years ago

You have to change the function read_MFB according to your situation. From line 12 to line 16, we load feature (it is assumed the feature is saved using pickle) and label. Feature size should be (n_frames, dim) as written in the comment. Label should be the speaker identity in string.

You can remove from line 20 to 24 because it is assumed that the front and back of the utterance is silence.