m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.34k stars 1.19k forks source link

diarization different whisperx vs pyannote #386

Open MyraBaba opened 1 year ago

MyraBaba commented 1 year ago

Hi,

When I diarize with pyannote it is better diarize but whisperx assume they are the same person n soma audio.

But whisperx also using pyannote how this could be. ?

MyraBaba commented 1 year ago

@diasks2

@mirix what is interesting the wav file is 1 hour length and not correctly diarize . But when I cut 10 min from the beginning its diarize perfectly , strange..

any idea ?

mirix commented 1 year ago

I think to remember this is a known "feature" of Whisper.

MyraBaba commented 1 year ago

@mirix would mind to elaborate more to undersatnd ?

mirix commented 1 year ago

Check:

https://github.com/openai/whisper/discussions/136

Whisper processes your audio as 30 seconds chunks. There are some heuristics to glue things back together but it may struggle in certain situations.

Another possibility is that your original audio file does not have the right sample rate (16000) but the converted chunks do.

Have a look here:

https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/whisper

and perhaps also here:

https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/whisper#transformers.WhisperProcessor

MyraBaba commented 1 year ago

The sample rates:

Big Wav:

Input #0, wav, from ‘Big.wav': Metadata: encoder : Lavf58.29.100 Duration: 01:43:53.61, bitrate: 1536 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s

8 min cut from bygone:

Input #0, wav, from ‘8min_big.wav': Metadata: encoder : Lavf58.29.100 Duration: 01:43:53.61, bitrate: 1536 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s

Sample rate same.

I faced this strange issue with that audio.

Sampel rate same.

What would be the difference between payanote default diarizetion

On 26 Jul 2023, at 12:34, mirix @.***> wrote:

Check:

openai/whisper#136 https://github.com/openai/whisper/discussions/136 Whisper processes your audio as 30 seconds chunks. There are some heuristics to glue things back together but it may struggle in certain situations.

Another possibility is that your original audio file does not have the right sample rate (16000) but the converted chunks do.

Have a look here:

https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/whisper https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/whisper and perhaps also here:

https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/whisper#transformers.WhisperProcessor https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/whisper#transformers.WhisperProcessor — Reply to this email directly, view it on GitHub https://github.com/m-bain/whisperX/issues/386#issuecomment-1651364271, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFRZH5D26OV7KUZAHRYNQDXSDQBNANCNFSM6AAAAAA2WALMMQ. You are receiving this because you authored the thread.

mirix commented 1 year ago

I believe that for Whisper you need to convert it to 16000 Hz, 1 channel (mono).

Perhaps pyannote does the pre-processing automatically.

MyraBaba commented 1 year ago

@mirix

I am attaching the audio drive link here .

Small one diarize correctly (first few min you can check) big one assumes there is only one speaker.

Mey be you can see a bug/problem and help you improve the whisperx

https://drive.google.com/drive/folders/1dejZrhnuX-SE9u5laxzqhEuvEIOFHiSF?usp=sharing

Best