Using different data sets

All the lists, files required to run the main code xvector_NeuralPlda_pytorch.py should be mentioned in a config file such as https://github.com/iiscleap/NeuralPlda/blob/master/conf/sre_config.cfg

To generate a list of training and validation trials and the x-vector dictionary, we run dataprep_*.py. This is different for different datasets, and we have provided some examples. You will need to modify this code accordingly for your datasets. Here are some pointers to consider.

General procedure to create the x-vector dictionary mega_xvector_pkl (Last stage of dataprep_*.py ): This is the part where we use the xvector.scp files generated by the kaldi recipe. If you have a list of x-vectors [x1, x2, ..., xN] (list of numpy arrays) and the associated speaker labels [y1, y2, ..., yN], and the corresponding list of utterance ids [u1, u2, ..., uN], you need to construct a x-vector dictionary of the form {u1:x1, u2:x2, ..., uN:xN}. Save this dictionary as a pickle. We call this pickled file as mega_xvector_pkl in the config file. (NOTE: If you don't have the utterance ids, you can simply name them like u1='1', u2='2', ..., uN='N'. )

Procedure to generate the training and validation trials (pairs of utterances belonging to same/different speakers): For this, we use the spk2utt files from the kaldi toolkit which is a text file that maps each speaker to all the utterance ids corresponding to the speaker. If you need the trials to be gender matched, you need to manually create spk2utt separately for each gender of each dataset you are using. Then running stage 1 of dataprep_*.py with your spk2utt files will generate the trials. Here, we use the xvector.scp only to check if the x-vector is present and skip those which are not present. This part can be skipped.

Hope this helps.

iiscleap / NeuralPlda

Using different data sets #3