Closed G874713346 closed 3 years ago
This might be due to improper paths of annotations. Your data path looks fine.
What annotations are you using? We use AMI manual annotations v1.6.2
Can you please share your manual_annot path? What are the sub-directories in your manual_annot/
directory?
Thank you for your reply, it works. I have one more question. Can this speaker diarization be carried out online? You mentioned Oracle VAD in experiment.py, I want to know how it works. And you use spectral clustering to make predictions, is this an offline speaker diarization?
Great!
Great!
- Yes, like many other SOTA systems our current recipe is an offline diarization system. Please feel free to contribute online systems, we would be happy to add them.
- Oracle VAD is a common term that is used when the VAD details are taken from the ground truth.
- Yes, we use spectral clustering with the offilne diariation.
Can we also assume that silence is a "speaker " and just ignore the VAD or SAD? Then we can treat the audio as an n+1 speaker mixture file which the "1" for silence.
Hi @xxoospring ,
Given the current SOTA approaches, I would prefer a VAD. If I am not wrong, I think some papers have also tried this earlier. You need to edit the data_prep accordingly. In fact, I also tried this a couple of years ago (but with some other method). As far as I remember, it worked on some cases while in some cases speakers with overlapping sections formed a cluster and some cases went random. I did not perform a detailed analysis on it. I think it is challenging to generalize unsupervised clustering (but interesting) as types of "silences" can vary significantly. Not very sure, but you may also want to look at some augmentation/supervised approaches where different silences are forced to have the same label :)
Looks like this issue is resolved. I will close this.
Segmentation part of the diarization is not included in diairization? I mean reference rttm and system rttm files do have the same segment start time and duration.
Hi @akishorekr ,
We perform segmentation by creating segments of 3 secs.
Please see section 3.3: https://arxiv.org/pdf/2104.01466.pdf
And also variables max_subseg_dur
and overlap=1.5
at: https://github.com/speechbrain/speechbrain/blob/6fa5a6b1162fc47e715c4b9753482c9db9e1d03e/recipes/AMI/ami_prepare.py#L35
So for every 1.5 sec you can expect to have one segment.
If segments in your custom dataset are smaller than 3 sec then please update these hyperparameters max_subseg_dur
and overlap
in case you wish to further reduce the size.
Thank you @nauman-daw Nauman What if we don't have reference rttm labels? The code still runs on its own to discover the speaker segments and reject silence segments? Thank you
I tried to run AMI example: ''python experiment.py hparams/ecapa_tdnn.yaml'', but reported an error I also downloaded AMI dataset and modified the path of yaml file This is the data and label under the corresponding path I found that the extracted ami-csv file contains 0 bytes. Is this the reason why the training is 0? Or is my data and label set incorrectly? I hope you can give me an answer, thank you!