speechbrain / speechbrain

A PyTorch-based Speech Toolkit
http://speechbrain.github.io
Apache License 2.0
8.8k stars 1.39k forks source link

AMI-diarization #771

Closed G874713346 closed 3 years ago

G874713346 commented 3 years ago

I tried to run AMI example: ''python experiment.py hparams/ecapa_tdnn.yaml'', but reported an error 微信图片_20210519195057 I also downloaded AMI dataset and modified the path of yaml file 微信图片_20210519194425 This is the data and label under the corresponding path 微信图片_20210519194537 微信图片_20210519194541 I found that the extracted ami-csv file contains 0 bytes. Is this the reason why the training is 0? Or is my data and label set incorrectly? I hope you can give me an answer, thank you! 微信截图_20210519194328

nauman-daw commented 3 years ago

This might be due to improper paths of annotations. Your data path looks fine. What annotations are you using? We use AMI manual annotations v1.6.2 Can you please share your manual_annot path? What are the sub-directories in your manual_annot/ directory?

G874713346 commented 3 years ago

Thank you for your reply, it works. I have one more question. Can this speaker diarization be carried out online? You mentioned Oracle VAD in experiment.py, I want to know how it works. And you use spectral clustering to make predictions, is this an offline speaker diarization?

nauman-daw commented 3 years ago

Great!

xxoospring commented 3 years ago

Great!

  • Yes, like many other SOTA systems our current recipe is an offline diarization system. Please feel free to contribute online systems, we would be happy to add them.
  • Oracle VAD is a common term that is used when the VAD details are taken from the ground truth.
  • Yes, we use spectral clustering with the offilne diariation.

Can we also assume that silence is a "speaker " and just ignore the VAD or SAD? Then we can treat the audio as an n+1 speaker mixture file which the "1" for silence.

nauman-daw commented 3 years ago

Hi @xxoospring ,

Given the current SOTA approaches, I would prefer a VAD. If I am not wrong, I think some papers have also tried this earlier. You need to edit the data_prep accordingly. In fact, I also tried this a couple of years ago (but with some other method). As far as I remember, it worked on some cases while in some cases speakers with overlapping sections formed a cluster and some cases went random. I did not perform a detailed analysis on it. I think it is challenging to generalize unsupervised clustering (but interesting) as types of "silences" can vary significantly. Not very sure, but you may also want to look at some augmentation/supervised approaches where different silences are forced to have the same label :)

nauman-daw commented 3 years ago

Looks like this issue is resolved. I will close this.

akishorekr commented 1 year ago

Segmentation part of the diarization is not included in diairization? I mean reference rttm and system rttm files do have the same segment start time and duration.

nauman-daw commented 1 year ago

Hi @akishorekr , We perform segmentation by creating segments of 3 secs. Please see section 3.3: https://arxiv.org/pdf/2104.01466.pdf And also variables max_subseg_dur and overlap=1.5 at: https://github.com/speechbrain/speechbrain/blob/6fa5a6b1162fc47e715c4b9753482c9db9e1d03e/recipes/AMI/ami_prepare.py#L35

So for every 1.5 sec you can expect to have one segment. If segments in your custom dataset are smaller than 3 sec then please update these hyperparameters max_subseg_dur and overlap in case you wish to further reduce the size.

akishorekr commented 1 year ago

Thank you @nauman-daw Nauman What if we don't have reference rttm labels? The code still runs on its own to discover the speaker segments and reject silence segments? Thank you