speaker diarization - Githubissues

mkiol / dsnote

Speech Note Linux app. Note taking, reading and translating with offline Speech to Text, Text to Speech and Machine translation.

Mozilla Public License 2.0

587 stars 20 forks source link

speaker diarization #84

Open devSJR opened 11 months ago

devSJR commented 11 months ago

Dsnote is great for STT by using whisper. For audio samples with different persons speaking, e.g. podcasts, movies …, one ends up with a messy text because Whisper doesn’t do what’s called ‘speaker diarization’. That is, identifying one voice or another. It seems there is a solution to this. Maybe you want to check the following:

https://ultracrepidarian.phfactor.net/

mkiol commented 11 months ago

Definitely, looks very interesting.

Processing pipeline seems to be as follows:

Audio transcription => "words" + timestamps
Audio segmentation => "speaker-id" + timestamps
Matching "words" to "speaker-id" based on timestamps

Currently Speech Note does not recognize timestamps, so this has to be implemented, but I've already need timestamps for subtitles support.

"Audio segmentation" is done with extra model pyannote/segmentation-3.0. To download this model you need to have Hugging Face account which might be problematic.

I will investigate what can be done.

devSJR commented 11 months ago

I am not really in need of this, but can imagine that other users might want this. I guess eventually there will be an easily accessible open-source audio segmentation model. Actually, it is quite surprising that there is so much available for users free of charge (including giving some personal information like an e-mail-address).

thob commented 6 months ago

I'm actually hoping for diarization. my use case (discourse analytics) benefits from a stable differentiation.

mkiol commented 6 months ago

I did some research to find out what is possible. It looks as follows:

Almost everyone uses pyannote segmentation models for diarization. The models work well... but not perfectly. The main problem is that the models are made available under the MIT license, but to download them you have to agree to certain conditions. To download them from HuggingFace you have to have an account and agree to the following conditions:

I must say, I don't like it. Especially this "You need to agree to share your contact". Speech Note is a privacy focused application and "sharing your contact information" doesn't fit well.

There is also “experimental” support for diarization in whisper.cpp. I love whisper.cpp and it's already integrated into Speech Note. The problem is that to use diarization you need to download a special model that combines STT and diarization. Currently this single model is only for English :/

I'm keep looking for another better solution...

devSJR commented 6 months ago

You are right. Privacy is an important asset. Regarding the “experimental” support for diarization in whisper.cpp, I think you should give it a go, if it is not too difficult to implement, and even if it is only for English at the moment.

thob commented 6 months ago

I'd need it for German … I checked the thread above, from what I understand it's still rather experimental. As a user I would be OK to accept the terms for being able to diarize – as long as there's no viable alternative.

thob commented 6 months ago

I found this repo, maybe it could be leveraged somehow https://github.com/m-bain/whisperX but it seems to rely on pyannote as well.

mkiol commented 6 months ago

Yes, WhisperX uses the same pyannote models. Therefore you have to pass HF token to use diarization :(

devSJR commented 4 months ago

Just learned about this

https://joss.theoj.org/papers/10.21105/joss.05266

Diart: A Python Library for Real-Time Speaker Diarization

mkiol commented 4 months ago

@devSJR unfortunately same pyannote models are needed to make it work :(

https://github.com/juanmc2005/diart?tab=readme-ov-file#get-access-to--pyannote-models

devSJR commented 4 months ago

OK, will keep looking

machiav3lli commented 4 months ago

Possibly relevant sources (some are more directly usable than others):