thomasmol / cog-whisper-diarization

Cog implementation of transcribing + diarization pipeline with Whisper & Pyannote
https://replicate.com/thomasmol/whisper-diarization
165 stars 51 forks source link

Diarisation seems very inaccurate #6

Closed callmephilip closed 7 months ago

callmephilip commented 7 months ago

Hey Thomas. I am seeing some very inaccurate results of speaker assignment on some test audios (2 speakers per file, both male with fairly distinctive voices). What has your experience been overall?

thomasmol commented 7 months ago

Hi there. My experience is that it is performing quite good. Can you give more details and examples?

callmephilip commented 7 months ago

here's an example https://replicate.com/p/j7yjql3bm6ofhkdh7rwyfd43ky

callmephilip commented 7 months ago

i reran this with an older version and it's looking much better - https://replicate.com/p/tqylf3tbokbpa7qhu4jo4p2p34. based on input from https://github.com/FanaHOVA/smol-podcaster/blob/main/smol_podcaster.py#L61

thomasmol commented 7 months ago

Could you try again with the latest version (b9fd8313c0d492bf1ce501b3d188f945389327730773ec1deb6ef233df6ea119)?

callmephilip commented 7 months ago

Could you try again with the latest version (b9fd8313c0d492bf1ce501b3d188f945389327730773ec1deb6ef233df6ea119)?

not working properly still https://replicate.com/p/lmtszi3bxadfoewofvteloh3ya

i am gonna stick to 7e5dafea13d80265ea436e51a310ae5103b9f16e2039f54de4eede3060a61617 for now, i think

callmephilip commented 7 months ago

Hey @thomasmol. Coming back to this, as I am trying to understand why I am seeing such a stark contrast in quality of diarization. I have noticed that 7e5dafea13d80265ea436e51a310ae5103b9f16e2039f54de4eede3060a61617 is using speechbrain/spkrec-ecapa-voxceleb model for getting speaker embeddings which are then manually clustered (?) to do speaker attribution.

what i am wondering is why you moved to pyannote/speaker-diarization-3.1 later? did you get better results with this new setup?

thomasmol commented 7 months ago

Hi Philip, i switched to pyannote 3.1 because it's a much improved model with more accurate diarization in all benchmarks. It has been working better for me as well than the model i used earlier. Could you try using the latest version of the replicate model but set group_segments to true and provide a prompt with names and other words with punctuation? (e.g. Thomas, Philip, diarization., this should improve the transcript quality and might help creating better speaker segmentation)

MaximeDde commented 6 months ago

Hi @callmephilip, thank you for the insights ! I've been seeing the same issue as you, and I find much better results reverting to the version you speak about...! I have to say I do not know much about the benchmarks and the way they are done, but this system seems to make a difference...