Open seset opened 1 year ago
Add the following
--max_line_width 42 --max_line_count 2
Add the following
--max_line_width 42 --max_line_count 2
thanks ,it can somehow impove...and when i try japanese or chinese file source, there is a lot repetitive words/letters/characters in srt and vtt output, while the json and txt content are correct, a chinese example mixed with english as below for your easying understanding what went wrong. Since the JSON file is correct, hopefully this minor error can be fixed soon:)
correct content(json and txt):----------------------------------------------------- [JSON] [segments] [0] [start : 0.6180930656934307] [end : 22.254860401459855] [text : "如何巧妙的去使用多个不同的ControlNet并用到Stable Diffusion的一些最新插件去精准的控制画面 生成一些这样的图片我研究了一整个星期 尝试了MultiControlNet不同的排列组合调整了很多不同的参数 把坑都踩过了一遍给你们整理了12种高阶用法的合计和所需要的所有参数"] [words]
wrong and repetitive (srt,vtt):------------------------------------------------------ 1 00:00:00,618 --> 00:00:22,235 如何 何巧 巧妙 妙的 的去 去使 使用 用多 多个 个不 不同 同的 的C Co on nt tr ro ol lN Ne et t并 并用 用到 到S St ta ab bl le e Di if ff fu us si io on n的 的一 一些 些最 最新 新插 插件 件去 去精 精准 准的 的控 控制 制画 画面 面 生成 成一 一些 些这 这样 样的 的图 图片 片我 我研 研究 究了 了一 一整 整个 个星 星期 期 尝试 试了 了M Mu ul lt ti iC Co on nt tr ro ol lN Ne et t不 不同 同的 的排 排列 列组 组合 合调 调整 整了 了很 很多 多不 不同 同的 的参 参数 数 把坑 坑都 都踩 踩过 过了 了一 一遍 遍给 给你 你们 们整 整理 理了 了12 2种高 高阶 阶用 用法 法的 的合 合计 计和 和所 所需 需要 要的 的所 所有 有参 参数 数
Add the following
--max_line_width 42 --max_line_count 2
this works for me, thanks :pray:
Although these options help somewhat, I'd definitely say v3 produces worse subtitle formatting than the older version, often breaking single words off sentences because there's no obvious way to find the perfect max line width and joining unrelated chunks that just worked before. Do you see any way to improve this? Here's an example:
Old version:
~~ Transcribing VAD chunk: (00:05.324 --> 00:34.045) ~~ [00:00.000 --> 00:01.840] Ik heb jullie zeer. Spreek voor jezelf, hè. [00:01.920 --> 00:03.880] Attention, please. This is Lancelot. [00:03.960 --> 00:06.160] Clap, clap. Switch seats. [00:06.240 --> 00:07.920] Keep quets. Perfect. [00:09.800 --> 00:12.680] Zoek ik al een boek over België. Ik ben de weg kwijt. [00:17.080 --> 00:19.480] Ik ben het de man. [00:19.560 --> 00:21.080] Je ziet het rare, jongen. [00:23.360 --> 00:24.480] Hallo. Hallo. [00:24.560 --> 00:25.840] Jullie zijn met een bus aan het rijden. [00:25.920 --> 00:26.640] Kijk, ja. [00:26.720 --> 00:28.120] Op die bus in een boom. [00:28.200 --> 00:29.200] Wat?
New version:
0:00:05.40,0:00:06.82: Ik heb jullie zeer. Spreek voor jezelf, 0:00:06.84,0:00:09.28: hè. Attention, please. This is Lancelot. 0:00:09.34,0:00:12.66: Clap, clap. Switch seats. Keep quets. 0:00:12.70,0:00:16.73: Perfect. Zoek ik al een boek over België. 0:00:16.81,0:00:22.51: Ik ben de weg kwijt. 0:00:22.55,0:00:25.71: Ik ben het de man. Je ziet het raar, 0:00:25.75,0:00:29.36: jongen. Hallo. 0:00:29.40,0:00:31.30: Hallo. Jullie zijn met een bus aan het 0:00:31.44,0:00:34.00: rijden? Ja. Op die bus in een boom? Wat?
Yes thanks for reporting, I found similar. Unfortunately the natural segments from whisper cannot be extracted in the current batched method.
Definitely the logic for post-processing the 30s chunks into segments needs to be improved. I would suggest the following:
nltk
toolbox, nltk.sent_tokenize
, tokenize the text into sentences (create segment for each sentence)Unfortunately my spare time is working on improving diarization right now, but feel free to send a pull request with improvements to this logic
Anyone working on a better segmentation right now? Otherwise I'd take a look at it.
so actually im on it atm as need sentence segments for diarization (also alignment logic needed cleaning up), will push in an hour or so
Should be hopefully fixed here https://github.com/m-bain/whisperX/commit/24008aa1ed67c4f75c90107b4937178a1452519d
Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization
Thanks a lot @m-bain, it's much improved. No broken sentences, though I do see the short segments appearing. Could probably be fixed by combining those segments together as a post-processing step, doesn't necessarily need to be in WhisperX.
Here are the results of the same file on the new version:
00:00:05,404 --> 00:00:06,165 Ik heb jullie zeer. 00:00:06,165 --> 00:00:06,985 Spreek voor jezelf, hè. 00:00:06,985 --> 00:00:08,246 Attention, please. 00:00:08,246 --> 00:00:09,347 This is Lancelot. 00:00:09,347 --> 00:00:10,228 Clap, clap. 00:00:10,228 --> 00:00:11,689 Switch seats. 00:00:11,689 --> 00:00:12,709 Keep quets. 00:00:12,709 --> 00:00:15,291 Perfect. 00:00:15,291 --> 00:00:16,812 Zoek ik al een boek over België. 00:00:16,812 --> 00:00:22,557 Ik ben de weg kwijt. 00:00:22,557 --> 00:00:24,978 Ik ben het de man. 00:00:24,978 --> 00:00:26,139 Je ziet het raar, jongen. 00:00:26,139 --> 00:00:29,402 Hallo. 00:00:29,402 --> 00:00:29,782 Hallo. 00:00:29,782 --> 00:00:31,803 Jullie zijn met een bus aan het rijden? 00:00:31,803 --> 00:00:32,064 Ja. 00:00:32,064 --> 00:00:33,685 Op die bus in een boom? 00:00:33,685 --> 00:00:34,005 Wat?
The biggest issue I see now is that each subtitle's end time appears to be the start of the next one even when this isn't accurate.
eg. The "Perfect" line above on the old version ended up at 13.244s which is accurate, and on v3 it stays on-screen over 2 seconds longer until 15.291s.
I still have the same problem even in the newest version (tried on ja and zh). Everything comes out in huge chunks.
Python 3.10.11
whisperx --model medium --language ja --compute_type int8 filename.ext
The process goes from Performing transcription... right into Performing alignment... withouth displaying any timestamps.
Same here. V3.1. Sentences are too long...
Should be hopefully fixed here 24008aa
Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization
thanks for the massive code improvement! And after updated i found below:
File "D:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\Python\Python310\Scripts\whisperx.exe__main.py", line 7, in
I am having the same problem with English as well.
Hi @seset
- for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.
you fixed it? We are facing same issue. but not able to fix it....
You can also just write your own script for merging word level timestamps into sentence level timestamps, if you want I can provide my script
Add the following
--max_line_width 42 --max_line_count 2
how to use in code
audio = whisperx.load_audio(audio_file) print("end load audio") result = model.transcribe(audio, batch_size=batch_size, max_line_width=42, max_line_count=2)
Is there any solution to this problem?
AttributeError: 'SpeakerDiarization' object has no attribute 'to'
I got a failure in here:
diarize_model = DiarizationPipeline(use_auth_token=HF_TOKEN, device=DEVICE)
Thanks!
@Omer-ler are you using the correct pyannote and whisperX versions? Try it in a clean environment maybe.
I've been tearing my hair out trying to figure out why this is happening. I'm getting the no attrribute "to" error code as well.
I've tried just using pyannote, I've tried with whisperx, I've tried in clean anvironments and I've tried without.
How do you install whisperx? Using a clean environment and running
pip install git+https://github.com/m-bain/whisperx.git
should work.
I'm still getting long segments in the latest, consisting of several sentences per segment rather than one sentence per segment. Is this expected? I've seen this with both English and French transcriptions.
Should be hopefully fixed here 24008aa
Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization
@sw5813 Hmmm if there is a full-stop then this isn't expected (the sentence tokenization should split up multiple sentences). Can you print some examples?
Overall there is a trade-off here: Batch inference provides. a big speedup but loses Whisper's shortert segment timestamps.
For the next big update I will try add functionality to support both ASR backends:
Since for some people shorter segments are most important and not speed so 1 might be a better ASR backend.
Sure, here's one of the outputs I got as a result of the transcription (before the alignment step):
{'segments': [ {'text': " qui permettent d'avoir un échantillon beaucoup plus vaste de patientes et d'être éventuellement plus représentatif de ce qu'on va pouvoir avoir finalement dans la vraie vie et avec les patientes qu'on va traiter. Donc c'est des choses qui sont parfaitement, qui peuvent se combiner, c'est deux types d'études totalement différentes.", 'start': 0.008, 'end': 15.483}, {'text': " mais qui vont donner aussi des informations différentes. Donc les deux avec leurs avantages, leurs inconvénients. Donc les résultats pour one hundred and fifty three thousand six hundred femmes qui ont réalisé two hundred and forty five thousand five hundred and thirty four ovarian stimulation. Donc c'est vraiment représentatif de l'AMP française. Donc entre le premier janvier deux mille treize et le trente et un décembre deux mille dix-huit. Et l'âge moyen de ces femmes était de trente-quatre virgule zéro sept ans. Donc ce qui est tout à fait en rapport avec les pratiques.", 'start': 15.483, 'end': 45.47}, {'text': " Le Système National des Données de Santé, ce fameux SNDS, est constitué des données de l'assurance-maladie, en fait, et exhaustif puisqu'on couvre, pardon, de la population au niveau de la France. Donc c'est quelque chose.", 'start': 45.47, 'end': 59.948} ], 'language': 'fr'}
FWIW I used the "suppress_numerals" setting which is why the numbers are written out, although I wonder if that may also be why there's some English that made its way into this French transcription...
After few months waiting , Whisperx is still the fastest and best!
My temp solution for verbose segment issue as below:
step 1: install whisperx in editable mode:
$ git clone https://github.com/m-bain/whisperX.git $ cd whisperX $ pip install -e .
step 2: fix segement duration problem edit below line in "asr.py" , change "30" into "8", i tried 5-10 seconds , the length of subs will be all acceptable. suggest @m-bain to add argument for this...
vad_segments = merge_chunks(vad_segments, 30)
step3: use "--no_align" to fix extra empty space between when transcribe zh, ja or other language , or to edit "transribe.py" to set it as default. because i don't see that much break lines when not useing alignment, totally acceptable..
Hi @seset
- for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.
you fixed it? We are facing same issue. but not able to fix it....
refer above...
After few months waiting , Whisperx is still the fastest and best!
My temp solution for verbose segment issue as below:
step 1: install whisperx in editable mode:
$ git clone https://github.com/m-bain/whisperX.git $ cd whisperX $ pip install -e .
step 2: fix segement duration problem edit below line in "asr.py" , change "30" into "8", i tried 5-10 seconds , the length of subs will be all acceptable. suggest @m-bain to add argument for this...
vad_segments = merge_chunks(vad_segments, 30)
step3: use "--no_align" to fix extra empty space between when transcribe zh, ja or other language , or to edit "transribe.py" to set it as default. because i don't see that much break lines when not useing alignment, totally acceptable..
@seset The --chunk_size
argument is added at #445. Please check if this issue is resolved.
And the ja, zh space issue is also resolved #248.
V3 is now incrediably fast, maybe dozens of times faster
but now the subtitles of each paragraph are too long . examples as below : 1 00:00:00,730 --> 00:00:26,190 Are you nervous? It's a good nervous, a happy nervous. Yeah, Matt's an incredible man, and it's obvious he's very much in love with you. I know. He's a little bit nervous too, you know, but he's holding up great. I'm glad. Yeah. Okay, well, I'm gonna go downstairs and work on those decorations. If you need anything, let me know. Love you, Mom. Love you, sweetheart. Oh, thank you.
2 00:00:34,973 --> 00:01:02,883 Oh, Jesus Christ. I thought you'd never leave. You need to get out of here, Ben. Like, right now. She could have caught us. But she didn't. And we weren't doing anything anyway. And who cares even if we were? I care? Okay, I'm not falling for one of your I wanna be with you routines. You only want me because you can't have me. Yeah, and you're only resisting because you're too preoccupied with being a petty rule follower. I am not.