Open devSJR opened 11 months ago
Definitely, looks very interesting.
Processing pipeline seems to be as follows:
Currently Speech Note does not recognize timestamps, so this has to be implemented, but I've already need timestamps for subtitles support.
"Audio segmentation" is done with extra model pyannote/segmentation-3.0. To download this model you need to have Hugging Face account which might be problematic.
I will investigate what can be done.
I am not really in need of this, but can imagine that other users might want this. I guess eventually there will be an easily accessible open-source audio segmentation model. Actually, it is quite surprising that there is so much available for users free of charge (including giving some personal information like an e-mail-address).
I'm actually hoping for diarization. my use case (discourse analytics) benefits from a stable differentiation.
I did some research to find out what is possible. It looks as follows:
I must say, I don't like it. Especially this "You need to agree to share your contact". Speech Note is a privacy focused application and "sharing your contact information" doesn't fit well.
whisper.cpp
and it's already integrated into Speech Note. The problem is that to use diarization you need to download a special model that combines STT and diarization. Currently this single model is only for English :/I'm keep looking for another better solution...
You are right. Privacy is an important asset. Regarding the “experimental” support for diarization in whisper.cpp, I think you should give it a go, if it is not too difficult to implement, and even if it is only for English at the moment.
I'd need it for German … I checked the thread above, from what I understand it's still rather experimental. As a user I would be OK to accept the terms for being able to diarize – as long as there's no viable alternative.
I found this repo, maybe it could be leveraged somehow https://github.com/m-bain/whisperX but it seems to rely on pyannote as well.
Yes, WhisperX uses the same pyannote models. Therefore you have to pass HF token to use diarization :(
Just learned about this
https://joss.theoj.org/papers/10.21105/joss.05266
Diart: A Python Library for Real-Time Speaker Diarization
@devSJR unfortunately same pyannote models are needed to make it work :(
https://github.com/juanmc2005/diart?tab=readme-ov-file#get-access-to--pyannote-models
OK, will keep looking
Possibly relevant sources (some are more directly usable than others):
Dsnote is great for STT by using whisper. For audio samples with different persons speaking, e.g. podcasts, movies …, one ends up with a messy text because Whisper doesn’t do what’s called ‘speaker diarization’. That is, identifying one voice or another. It seems there is a solution to this. Maybe you want to check the following:
https://ultracrepidarian.phfactor.net/