nttcslab-sp / mamba-diarization

Official repository for Mamba-based Segmentation Model for Speaker Diarization
Other
24 stars 3 forks source link

How's the performance compare with cam++ #3

Closed MonolithFoundation closed 4 weeks ago

MonolithFoundation commented 1 month ago

the model comes from funASR

jeremy110 commented 1 month ago

Cam++ is a feature extraction model, and mamba-diarization is a local EEND model, making direct comparison impossible.

MonolithFoundation commented 1 month ago

Hi, in terms of speaker diarization, CAM++ can be used for this too. Which one could yeild a better performance?

(I might notice that this method only supports 2 speakers?)

How about making this mamba-diarization also as a feature extactor?

jeremy110 commented 1 month ago

Hi, Speaker diarization can be categorized into several approaches:

  1. Local EEND + Global Aggregation (e.g., PyAnnote)
  2. Fixed Window Size with Feature Extraction Model (e.g., CAM++), which extracts embeddings and applies clustering algorithms (likely from FunASR)
  3. Encoder-Decoder Based Attractors

Thus, the architectures are entirely different.

MonolithFoundation commented 1 month ago

Could mamba-diarization able to do more than 2 speakers?

jeremy110 commented 1 month ago

Yes, you can set different numbers of speakers, and the corresponding window size for each number of speakers can be referenced from the paper.

MonolithFoundation commented 1 month ago

From the conclusion in paper, it was said the limitations is mamaba-diarizaiton only supports 2 speakers, it that means 12 for exmaple could get worse result?

jeremy110 commented 1 month ago

Where does the conclusion mention the limitation of two speakers?

MonolithFoundation commented 1 month ago

Oh, it appears that I have interacted with another paper. Have you tested the performance comparison with pyannote? Is it superior to that?

FrenchKrab commented 1 month ago

Yes, at least all results obtained in the paper suggest that :) Pyannote's publicly available pipelines use (LSTM+SincNet) models (where sincnet is a feature extractor and LSTM the "processing module").

And we found that for DER: (Mamba+WavLM) > (LSTM+WavLM) > (LSTM+SincNet)

I'm not sure about the question about the feature extractor. If you want to get features from our architecture you'll need to get the ones generated by the WavLM module (huge pretrained SSL). If you want to use Cam++ as a feature extractor instead of WavLM that's probably possible but you will need to implement it yourself.

MonolithFoundation commented 1 month ago

Yes, that's exactly where the question locates. I haven't seen anyone compared with CAM++ and SincNet or WaveLM, regardless the process module.

For feature extraction performance, do u have any thoughts which could be better?

FrenchKrab commented 1 month ago

I would guess that since WavLM is trained on overlapped speech (tricky problem in diarization) but Cam++ does not seem to be, maybe WavLM would still be better. And for inference speed Cam++ may be better. But that's just a blind guess, i may be wrong.

MonolithFoundation commented 1 month ago

@FrenchKrab thank u. Still wanna ask last question.

Compare with EEND method and CAM++ in funasr which uses k-means clustering, which way is better ?

FrenchKrab commented 1 month ago

I'm not sure they are comparable. From what I've read, CAM++ is designed for speaker identification / extracting embeddings, which can be used in the process of speaker diarization (depends on the pipeline/method) but it can't really do speaker diarization by itself.

MonolithFoundation commented 1 month ago

Yes, the comparasion is not about CAM++ amd EEND itself, but from the final result of speaker diarization, with EEND and CAM++ and clustering.

Actually, sepcifically, is the EEND and clustering, which one is more powerful

FrenchKrab commented 1 month ago

Not sure I understand, and I don't think I can answer any question related to CAM++ performance for diarization, you should test it yourself.