Open mjtechguy opened 3 weeks ago
Hi, I've made a TODO list in the README and added it. I'll work on it later!
I'm testing whisperX and listing some issues here:
torch 1.10.0+cu102
and this WebUI uses torch 2.3.1+cu121
16.5 sec
for 30 secs of audio input with large-v2
@jhj0517 looking at the speaker diarization it seems that it uses a different model from HF, so it can be integrated without the whisperX model @mjtechguy
Yes, it seems that whisperX post-process diarization with the result of the faster-whisper. So I think I should modularize the diarization and integrate it with faster-whisper.
Speaker diarization is now enabled in #181.
Diarization is embedded into the text with |
divider. For example,
w/ diarization:
1
00:00:00,000 --> 00:00:04,879
SPEAKER_00|Now, as all books not primarily intended as picture books
2
00:00:04,879 --> 00:00:08,880
SPEAKER_00|consist principally of types composed to form letterpress,
w/o diarization:
1
00:00:00,000 --> 00:00:04,879
Now, as all books not primarily intended as picture books
2
00:00:04,879 --> 00:00:08,880
consist principally of types composed to form letterpress,
Note : To download diarization model for the first time, you need Huggignface Token and mannually go to https://huggingface.co/pyannote/speaker-diarization-3.1 and agree to their terms.
@jhj0517 trying the latest version with diarization, but I am getting this error, it seems it downloaded the model but it didn't finish the diarization.
2024-06-26T19:36:55.526316618Z Traceback (most recent call last):
2024-06-26T19:36:55.526636537Z File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/queueing.py", line 527, in process_events
2024-06-26T19:36:55.526654992Z response = await route_utils.call_process_api(
2024-06-26T19:36:55.526661835Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526667215Z File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/route_utils.py", line 270, in call_process_api
2024-06-26T19:36:55.526672605Z output = await app.get_blocks().process_api(
2024-06-26T19:36:55.526677936Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526685630Z File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/blocks.py", line 1856, in process_api
2024-06-26T19:36:55.526693645Z data = await self.postprocess_data(fn_index, result["prediction"], state)
2024-06-26T19:36:55.526700999Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526709536Z File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/blocks.py", line 1634, in postprocess_data
2024-06-26T19:36:55.526717781Z self.validate_outputs(fn_index, predictions) # type: ignore
2024-06-26T19:36:55.526725736Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526734253Z File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/blocks.py", line 1610, in validate_outputs
2024-06-26T19:36:55.526743881Z raise ValueError(
2024-06-26T19:36:55.526752507Z ValueError: An event handler (transcribe_file) didn't receive enough output values (needed: 2, received: 1).
2024-06-26T19:36:55.526760563Z Wanted outputs:
2024-06-26T19:36:55.526767927Z [<gradio.components.textbox.Textbox object at 0x78caea9c2590>, <gradio.templates.Files object at 0x78caea231350>]
2024-06-26T19:36:55.526795169Z Received outputs:
2024-06-26T19:36:55.526800238Z [None]
@moda20 Can you show the full log before the Traceback? This could happen if the model failed to load.
To use pyannote model, you need to go to the
and manually accept its terms and enter the Huggingface token..
It may be inconvenient, but it's their requirement for now. I hope there is a better way than this.
@jhj0517 Yes, accepting the conditions of the second segmentation HF model, did the trick. i didn't see it in the README, that's why
~EDIT : i am able to transcribe using small and small.en only. i run into the same error message as before for anything beyond those. Also, i don't get any logs before that error, although i am using the docker version of the web-ui so it might be the reason why.~ Wrong alert it was a VRAM issue
@moda20 Trying to run diarization models with CPU may help in that case. You can change the device in the dropdown.
accepted both terms of service for the stated models and added read token then it gives an error
When the file format is TXT, the first character of the output is hidden by the speaker delimiter This may be difficult to understand in Japanese, but it is as follows.
w/ diarization: SPEAKER_04|部科学省の数理データサイエンスAI教育プログラム認定制度に SPEAKER_04|ータサイエンス教育プログラムの所持申請を行ったという報告がありまして、
w/o diarization: 文部科学省の数理データサイエンスAI教育プログラム認定制度に データサイエンス教育プログラムの所持申請を行ったという報告がありまして、
@cookiexND Thanks for reporting this. It's fixed in #183
@Tom-Neverwinter Can you provide more information about the error you received?
I think it would be great to be able to leverage WhisperX and speaker diarization. Any plans to do this?
https://github.com/m-bain/whisperX