the accuracy of speaker diarization without cluster numbers

k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker diarization, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust

https://k2-fsa.github.io/sherpa/onnx/index.html

Apache License 2.0

3.7k stars 430 forks source link

the accuracy of speaker diarization without cluster numbers #1466

Open Xiaobx-lab opened 1 month ago

Xiaobx-lab commented 1 month ago

I attempted to process a long audio file using '3dspeaker+segmentation.onnx' you provided but noticed a strong decrease in accuracy during speaker diarization when cluster numbers were not specified. I also compared it with the speaker diarization model 3.1 provided by Pyannote on HuggingFace, and it seems to perform better. However, I'm unsure how to deploy that model on Android. Could you please advise me on how to solve this issue?

csukuangfj commented 1 month ago

You need to tune the threshold by yourself.

Xiaobx-lab commented 1 month ago

You need to tune the threshold by yourself.

I'm working on implementing a function for recording meeting transcripts, but I don’t know the exact number of clusters beforehand, and a single cluster threshold doesn’t seem suitable for different audio files. What approach should I take?