Training with own data - Githubissues

HimaJyothi17 commented 5 months ago

Hi,

I would like to train the plda with embeddings from my speaker verification model. Can you give some insights on how to create train, val and eval data.

Thank You

luferrer commented 5 months ago

Dear HimaJyothi17,

The example data is meant to help with that. Unfortunately, the sftp server where the data is hosted is down at the moment. I am asking the sys admin to look into it. In the midtime, I have added a description of the inputs needed for training and evaluation at the end of the readme file and I am attaching a tar file with a subset of the contents in the sftp link (excluding the embeddings files and including only a subset of the lines in the training metadata file, so that it can be uploaded here given the size limit). You can look at the scripts inside examples/speaker_verification for an example on how to run the code using these lists, though you won't be able to run those scripts without the embeddings. Hopefully the sftp server will be back up soon and you'll be able to download the tar file with all the data needed to run those scripts.

I hope that helps and please let me know if you have any further questions.

Luciana data.tar.gz

HimaJyothi17 commented 4 months ago

While Initializing trial loader, I'm getting this error. raise Exception("Not enough %ss for some combination of %s (there should be at least %d %s per %s)"% Exception: Not enough session_ids for some combination of ['class_id', 'domain_id'] (there should be at least 2 session_id per ['class_id', 'domain_id'])

Any Idea where i might be doing mistake in creating data?

HimaJyothi17 commented 4 months ago

It worked when I change min_len to 1 in ./DCA-PLDA/dca_plda/data.py 211 line self.sessions_for_class_dom, self.sessi = self._init_index_and_list(['class_id',dom_col], 'session_id', min_len= if check_count_per_sess else 1) How should I prepare the data to pass this condition?

luferrer commented 4 months ago

That check is to make sure that a target trial can be created for every single training speaker. For a target trial to be created, they need to come from different sessions. So, your dataset (the metadata file) should contain at least 2 sessions per speaker. You can also disable that check by setting check_count_per_sess to False in the training config. Yet, I would advice you to make sure that most of your speakers have 2 sessions, else the model will not be trained properly.

HimaJyothi17 commented 4 months ago

RuntimeError: CUDA out of memory. Tried to allocate 18.15 GiB (GPU 0; 47.54 GiB total capacity; 20.60 GiB already allocated; 1.30 GiB free; 36.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF. while validating the first epoch. Tried reducing batch size but failing at validation. I think, Multi-GPU support is not there. Any suggestions to resolve this.

luferrer commented 4 months ago

How large is the validation set? How many enrollment and test ids? The issue may be the size of the score matrix.

luferrer / DCA-PLDA

Training with own data #8