Difference between stable-ts and whisperx

bryanyzhu commented 1 year ago

Hi, I noticed that you use Whisper model from stable-ts for transcribing, instead of using the WhisperX approach. Curious to know the difference and why picking stable-ts. Because WhisperX also claims to use cut/merge/phoneme model to improve the correctness of timestamps. Does stable-ts show better performance? The reason is WhisperX is super fast compared to using vanilla whisper model.

Also I see in readme that the evaluation data usually contains 1~3 speakers. If the audio contains more speakers, sometimes even 10, will the current pipeline work out of box? Thank you.

mirix commented 1 year ago

Hi @bryanyzhu,

When this procedure was developed stable-ts seemed to provide a more consistent splitting. This may have changed as the WhisperX roadmap mentioned that improvements in this area were to be expected in the upcoming versions. So perhaps it has already happened. I do not know. However, what I like about stable-ts is the fact that is easily configurable, which may come handy for multilingual projects, as is ours.

Yes, WhisperX is faster than vanilla Whisper, but perhaps stable-ts can be hacked to use something like Faster Whisper or even WhisperX (but just for the raw transcription).

Regarding the number of speakers, please, try and let me know. The results may still be reasonable but they are unlikely to be optimal. My guess is that you will need to adjust the UMAP and HDBSCAN parameters for your specific dataset in order to obtain optimal results.

If you are familiar with these tools you will probably have some hints of what may work in order to obtain more clusters. Otherwise, it will involve some learning curve and a considerable amount of trial and error.

The idea for the future would be to have some sort of ML or heuristics that will adjust those parameters automatically. But that is a lot of work and I have moved into something else.

bryanyzhu commented 1 year ago

Thanks a lot for your reply. Let me play with these tools for a bit, and see if the chosen parameters can generalize. I'm new to this domain, so probably will take some time.

mirix / approaches-to-diarisation

Difference between stable-ts and whisperx #7