mkiol / dsnote

Speech Note Linux app. Note taking, reading and translating with offline Speech to Text, Text to Speech and Machine translation.
Mozilla Public License 2.0
519 stars 20 forks source link

speaker diarization #84

Open devSJR opened 9 months ago

devSJR commented 9 months ago

Dsnote is great for STT by using whisper. For audio samples with different persons speaking, e.g. podcasts, movies …, one ends up with a messy text because Whisper doesn’t do what’s called ‘speaker diarization’. That is, identifying one voice or another. It seems there is a solution to this. Maybe you want to check the following:

https://ultracrepidarian.phfactor.net/

mkiol commented 9 months ago

Definitely, looks very interesting.

Processing pipeline seems to be as follows:

  1. Audio transcription => "words" + timestamps
  2. Audio segmentation => "speaker-id" + timestamps
  3. Matching "words" to "speaker-id" based on timestamps

Currently Speech Note does not recognize timestamps, so this has to be implemented, but I've already need timestamps for subtitles support.

"Audio segmentation" is done with extra model pyannote/segmentation-3.0. To download this model you need to have Hugging Face account which might be problematic.

I will investigate what can be done.

devSJR commented 9 months ago

I am not really in need of this, but can imagine that other users might want this. I guess eventually there will be an easily accessible open-source audio segmentation model. Actually, it is quite surprising that there is so much available for users free of charge (including giving some personal information like an e-mail-address).

thob commented 4 months ago

I'm actually hoping for diarization. my use case (discourse analytics) benefits from a stable differentiation.

mkiol commented 4 months ago

I did some research to find out what is possible. It looks as follows:

image

I must say, I don't like it. Especially this "You need to agree to share your contact". Speech Note is a privacy focused application and "sharing your contact information" doesn't fit well.

I'm keep looking for another better solution...

devSJR commented 4 months ago

You are right. Privacy is an important asset. Regarding the “experimental” support for diarization in whisper.cpp, I think you should give it a go, if it is not too difficult to implement, and even if it is only for English at the moment.

thob commented 4 months ago

I'd need it for German … I checked the thread above, from what I understand it's still rather experimental. As a user I would be OK to accept the terms for being able to diarize – as long as there's no viable alternative.

thob commented 4 months ago

I found this repo, maybe it could be leveraged somehow https://github.com/m-bain/whisperX but it seems to rely on pyannote as well.

mkiol commented 4 months ago

Yes, WhisperX uses the same pyannote models. Therefore you have to pass HF token to use diarization :(

devSJR commented 2 months ago

Just learned about this

https://joss.theoj.org/papers/10.21105/joss.05266

Diart: A Python Library for Real-Time Speaker Diarization

mkiol commented 2 months ago

@devSJR unfortunately same pyannote models are needed to make it work :(

https://github.com/juanmc2005/diart?tab=readme-ov-file#get-access-to--pyannote-models

devSJR commented 2 months ago

OK, will keep looking

machiav3lli commented 2 months ago

Possibly relevant sources (some are more directly usable than others):