Open hmehdi515 opened 1 month ago
Hi @hmehdi515,
First of all, your sample rate should be 16000. The example in the README must be old. I removed the sample_rate
attribute from the model to make it easier to integrate custom models. Would you mind creating a PR to fix the example? It would be greatly appreciated!
Concerning the 3-speaker output, this is normal and depends on how many maximum speakers are predicted by the segmentation model. In this case, the segmentation output is a matrix of (num_speakers=3, num_frames)
. To get the embeddings corresponding to "active" speakers you should filter depending on the segmentation activation. For example, in the diarization pipeline we use the tau_active
threshold which applies the following rule: if any predicted speaker S has at least 1 frame where the probability of speech p(S)
satisfies p(S) >= tau_active
, then S is considered "active" and we keep its embedding.
Bear in mind that this is not necessarily the best rule for every use case, so I encourage you to try different alternatives.
Thanks for your help. I submitted a PR with some changes.
Do you know how to change the num_speakers
on SpeakerSegmentation
? I know that we could create a config for SpeakerDiarization
, not sure if we can do something similar for SpeakerSegmentation
.
Changing the number of speakers would require to re-train the segmentation model or fine-tuning it to produce a matrix of a different size (adding or removing speaker rows)
Hi,
I am trying to run a pipeline to extract embeddings
The pipeline I am running is the one in the README:
Although SegmentationModel has no attribute sample_rate
So I tried replacing it with
and all I get from output is :
Not sure why it is detecting 3 speakers when I am the only one talking. The entire output confuses me.
Any help is appreciated.
Edit : I did come across https://github.com/juanmc2005/diart/issues/214 but I still am not sure how to actually perform the embedding extraction.
Edit 2 :
Taking out
.shape
does print out values: