WhisperX - Speaker Diarization

mjtechguy commented 3 weeks ago

I think it would be great to be able to leverage WhisperX and speaker diarization. Any plans to do this?

https://github.com/m-bain/whisperX

jhj0517 commented 3 weeks ago

Hi, I've made a TODO list in the README and added it. I'll work on it later!

jhj0517 commented 1 week ago

I'm testing whisperX and listing some issues here:

Incompatible torch version
- whisperX models were trained on torch 1.10.0+cu102 and this WebUI uses torch 2.3.1+cu121
Slow transcription
- This may be due to an incompatible torch, but it was much slower than other implementations.
  16.5 sec for 30 secs of audio input with large-v2

moda20 commented 1 week ago

@jhj0517 looking at the speaker diarization it seems that it uses a different model from HF, so it can be integrated without the whisperX model @mjtechguy

jhj0517 commented 1 week ago

Yes, it seems that whisperX post-process diarization with the result of the faster-whisper. So I think I should modularize the diarization and integrate it with faster-whisper.

jhj0517 commented 1 week ago

Speaker diarization is now enabled in #181.

Diarization is embedded into the text with | divider. For example,

w/ diarization:

1
00:00:00,000 --> 00:00:04,879
SPEAKER_00|Now, as all books not primarily intended as picture books

2
00:00:04,879 --> 00:00:08,880
SPEAKER_00|consist principally of types composed to form letterpress,

w/o diarization:

1
00:00:00,000 --> 00:00:04,879
Now, as all books not primarily intended as picture books

2
00:00:04,879 --> 00:00:08,880
consist principally of types composed to form letterpress,

Note : To download diarization model for the first time, you need Huggignface Token and mannually go to https://huggingface.co/pyannote/speaker-diarization-3.1 and agree to their terms.

moda20 commented 1 week ago

@jhj0517 trying the latest version with diarization, but I am getting this error, it seems it downloaded the model but it didn't finish the diarization.

2024-06-26T19:36:55.526316618Z Traceback (most recent call last):
2024-06-26T19:36:55.526636537Z   File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/queueing.py", line 527, in process_events
2024-06-26T19:36:55.526654992Z     response = await route_utils.call_process_api(
2024-06-26T19:36:55.526661835Z                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526667215Z   File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/route_utils.py", line 270, in call_process_api
2024-06-26T19:36:55.526672605Z     output = await app.get_blocks().process_api(
2024-06-26T19:36:55.526677936Z              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526685630Z   File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/blocks.py", line 1856, in process_api
2024-06-26T19:36:55.526693645Z     data = await self.postprocess_data(fn_index, result["prediction"], state)
2024-06-26T19:36:55.526700999Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526709536Z   File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/blocks.py", line 1634, in postprocess_data
2024-06-26T19:36:55.526717781Z     self.validate_outputs(fn_index, predictions)  # type: ignore
2024-06-26T19:36:55.526725736Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526734253Z   File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/blocks.py", line 1610, in validate_outputs
2024-06-26T19:36:55.526743881Z     raise ValueError(
2024-06-26T19:36:55.526752507Z ValueError: An event handler (transcribe_file) didn't receive enough output values (needed: 2, received: 1).
2024-06-26T19:36:55.526760563Z Wanted outputs:
2024-06-26T19:36:55.526767927Z     [<gradio.components.textbox.Textbox object at 0x78caea9c2590>, <gradio.templates.Files object at 0x78caea231350>]
2024-06-26T19:36:55.526795169Z Received outputs:
2024-06-26T19:36:55.526800238Z     [None]

jhj0517 commented 1 week ago

@moda20 Can you show the full log before the Traceback? This could happen if the model failed to load.

To use pyannote model, you need to go to the

and manually accept its terms and enter the Huggingface token..

It may be inconvenient, but it's their requirement for now. I hope there is a better way than this.

moda20 commented 1 week ago

@jhj0517 Yes, accepting the conditions of the second segmentation HF model, did the trick. i didn't see it in the README, that's why

~EDIT : i am able to transcribe using small and small.en only. i run into the same error message as before for anything beyond those. Also, i don't get any logs before that error, although i am using the docker version of the web-ui so it might be the reason why.~ Wrong alert it was a VRAM issue

jhj0517 commented 1 week ago

@moda20 Trying to run diarization models with CPU may help in that case. You can change the device in the dropdown.

Tom-Neverwinter commented 1 week ago

accepted both terms of service for the stated models and added read token then it gives an error

cookiexND commented 1 week ago

When the file format is TXT, the first character of the output is hidden by the speaker delimiter This may be difficult to understand in Japanese, but it is as follows.

w/ diarization: SPEAKER_04|部科学省の数理データサイエンスAI教育プログラム認定制度に SPEAKER_04|ータサイエンス教育プログラムの所持申請を行ったという報告がありまして、

w/o diarization: 文部科学省の数理データサイエンスAI教育プログラム認定制度にデータサイエンス教育プログラムの所持申請を行ったという報告がありまして、

jhj0517 commented 1 week ago

@cookiexND Thanks for reporting this. It's fixed in #183

@Tom-Neverwinter Can you provide more information about the error you received?

jhj0517 / Whisper-WebUI

WhisperX - Speaker Diarization #168