V3 sentence segement issue

seset commented 1 year ago

V3 is now incrediably fast, maybe dozens of times faster
but now the subtitles of each paragraph are too long . examples as below : 1 00:00:00,730 --> 00:00:26,190 Are you nervous? It's a good nervous, a happy nervous. Yeah, Matt's an incredible man, and it's obvious he's very much in love with you. I know. He's a little bit nervous too, you know, but he's holding up great. I'm glad. Yeah. Okay, well, I'm gonna go downstairs and work on those decorations. If you need anything, let me know. Love you, Mom. Love you, sweetheart. Oh, thank you.

2 00:00:34,973 --> 00:01:02,883 Oh, Jesus Christ. I thought you'd never leave. You need to get out of here, Ben. Like, right now. She could have caught us. But she didn't. And we weren't doing anything anyway. And who cares even if we were? I care? Okay, I'm not falling for one of your I wanna be with you routines. You only want me because you can't have me. Yeah, and you're only resisting because you're too preoccupied with being a petty rule follower. I am not.

m-bain commented 1 year ago

Add the following

--max_line_width 42 --max_line_count 2

seset commented 1 year ago

Add the following

--max_line_width 42 --max_line_count 2

thanks ,it can somehow impove...and when i try japanese or chinese file source, there is a lot repetitive words/letters/characters in srt and vtt output, while the json and txt content are correct, a chinese example mixed with english as below for your easying understanding what went wrong. Since the JSON file is correct, hopefully this minor error can be fixed soon:)

correct content(json and txt):----------------------------------------------------- [JSON] [segments] [0] [start : 0.6180930656934307] [end : 22.254860401459855] [text : "如何巧妙的去使用多个不同的ControlNet并用到Stable Diffusion的一些最新插件去精准的控制画面生成一些这样的图片我研究了一整个星期尝试了MultiControlNet不同的排列组合调整了很多不同的参数把坑都踩过了一遍给你们整理了12种高阶用法的合计和所需要的所有参数"] [words]

wrong and repetitive (srt,vtt):------------------------------------------------------ 1 00:00:00,618 --> 00:00:22,235 如何何巧巧妙妙的的去去使使用用多多个个不不同同的的C Co on nt tr ro ol lN Ne et t并并用用到到S St ta ab bl le e Di if ff fu us si io on n的的一一些些最最新新插插件件去去精精准准的的控控制制画画面面生成成一一些些这这样样的的图图片片我我研研究究了了一一整整个个星星期期尝试试了了M Mu ul lt ti iC Co on nt tr ro ol lN Ne et t不不同同的的排排列列组组合合调调整整了了很很多多不不同同的的参参数数把坑坑都都踩踩过过了了一一遍遍给给你你们们整整理理了了12 2种高高阶阶用用法法的的合合计计和和所所需需要要的的所所有有参参数数

JackCloudman commented 1 year ago

Add the following

--max_line_width 42 --max_line_count 2

this works for me, thanks :pray:

jeybee commented 1 year ago

Although these options help somewhat, I'd definitely say v3 produces worse subtitle formatting than the older version, often breaking single words off sentences because there's no obvious way to find the perfect max line width and joining unrelated chunks that just worked before. Do you see any way to improve this? Here's an example:

Old version:

~~ Transcribing VAD chunk: (00:05.324 --> 00:34.045) ~~ [00:00.000 --> 00:01.840] Ik heb jullie zeer. Spreek voor jezelf, hè. [00:01.920 --> 00:03.880] Attention, please. This is Lancelot. [00:03.960 --> 00:06.160] Clap, clap. Switch seats. [00:06.240 --> 00:07.920] Keep quets. Perfect. [00:09.800 --> 00:12.680] Zoek ik al een boek over België. Ik ben de weg kwijt. [00:17.080 --> 00:19.480] Ik ben het de man. [00:19.560 --> 00:21.080] Je ziet het rare, jongen. [00:23.360 --> 00:24.480] Hallo. Hallo. [00:24.560 --> 00:25.840] Jullie zijn met een bus aan het rijden. [00:25.920 --> 00:26.640] Kijk, ja. [00:26.720 --> 00:28.120] Op die bus in een boom. [00:28.200 --> 00:29.200] Wat?

New version:

0:00:05.40,0:00:06.82: Ik heb jullie zeer. Spreek voor jezelf, 0:00:06.84,0:00:09.28: hè. Attention, please. This is Lancelot. 0:00:09.34,0:00:12.66: Clap, clap. Switch seats. Keep quets. 0:00:12.70,0:00:16.73: Perfect. Zoek ik al een boek over België. 0:00:16.81,0:00:22.51: Ik ben de weg kwijt. 0:00:22.55,0:00:25.71: Ik ben het de man. Je ziet het raar, 0:00:25.75,0:00:29.36: jongen. Hallo. 0:00:29.40,0:00:31.30: Hallo. Jullie zijn met een bus aan het 0:00:31.44,0:00:34.00: rijden? Ja. Op die bus in een boom? Wat?

m-bain commented 1 year ago

Yes thanks for reporting, I found similar. Unfortunately the natural segments from whisper cannot be extracted in the current batched method.

Definitely the logic for post-processing the 30s chunks into segments needs to be improved. I would suggest the following:

Use nltk toolbox, nltk.sent_tokenize, tokenize the text into sentences (create segment for each sentence)
For long sentences, these can be further broken at comma locations.

Unfortunately my spare time is working on improving diarization right now, but feel free to send a pull request with improvements to this logic

sorgfresser commented 1 year ago

Anyone working on a better segmentation right now? Otherwise I'd take a look at it.

m-bain commented 1 year ago

so actually im on it atm as need sentence segments for diarization (also alignment logic needed cleaning up), will push in an hour or so

m-bain commented 1 year ago

Should be hopefully fixed here https://github.com/m-bain/whisperX/commit/24008aa1ed67c4f75c90107b4937178a1452519d

Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization

jeybee commented 1 year ago

Thanks a lot @m-bain, it's much improved. No broken sentences, though I do see the short segments appearing. Could probably be fixed by combining those segments together as a post-processing step, doesn't necessarily need to be in WhisperX.

Here are the results of the same file on the new version:

00:00:05,404 --> 00:00:06,165 Ik heb jullie zeer. 00:00:06,165 --> 00:00:06,985 Spreek voor jezelf, hè. 00:00:06,985 --> 00:00:08,246 Attention, please. 00:00:08,246 --> 00:00:09,347 This is Lancelot. 00:00:09,347 --> 00:00:10,228 Clap, clap. 00:00:10,228 --> 00:00:11,689 Switch seats. 00:00:11,689 --> 00:00:12,709 Keep quets. 00:00:12,709 --> 00:00:15,291 Perfect. 00:00:15,291 --> 00:00:16,812 Zoek ik al een boek over België. 00:00:16,812 --> 00:00:22,557 Ik ben de weg kwijt. 00:00:22,557 --> 00:00:24,978 Ik ben het de man. 00:00:24,978 --> 00:00:26,139 Je ziet het raar, jongen. 00:00:26,139 --> 00:00:29,402 Hallo. 00:00:29,402 --> 00:00:29,782 Hallo. 00:00:29,782 --> 00:00:31,803 Jullie zijn met een bus aan het rijden? 00:00:31,803 --> 00:00:32,064 Ja. 00:00:32,064 --> 00:00:33,685 Op die bus in een boom? 00:00:33,685 --> 00:00:34,005 Wat?

The biggest issue I see now is that each subtitle's end time appears to be the start of the next one even when this isn't accurate.

eg. The "Perfect" line above on the old version ended up at 13.244s which is accurate, and on v3 it stays on-screen over 2 seconds longer until 15.291s.

rockmor commented 1 year ago

I still have the same problem even in the newest version (tried on ja and zh). Everything comes out in huge chunks. Python 3.10.11 whisperx --model medium --language ja --compute_type int8 filename.ext

The process goes from Performing transcription... right into Performing alignment... withouth displaying any timestamps.

shruru commented 1 year ago

Same here. V3.1. Sentences are too long...

seset commented 1 year ago

Should be hopefully fixed here 24008aa

Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization

thanks for the massive code improvement! And after updated i found below:

for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.
for fr , ja and zh, there is still no much improvement on segements length. especially for ja and zh these kind of syllabary/ideographic characters, i noticed the updated code for eliminate repetitive characters, but the "extra space" between makes the ja and zh segements displays even more longer. so i use --no_align for ja and zh for now.
when i try to use diarization to see whether helpful to segements, errors below, no much luck in using --diarize ever since v3 updated.

File "D:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\Python\Python310\Scripts\whisperx.exe__main.py", line 7, in File "D:\Python\Python310\lib\site-packages\whisperx\transcribe.py", line 192, in cli diarize_model = DiarizationPipeline(use_auth_token=hf_token, device=device) File "D:\Python\Python310\lib\site-packages\whisperx\diarize.py", line 16, in init self.model = Pipeline.from_pretrained(model_name, use_auth_token=use_auth_token).to(device) File "D:\Python\Python310\lib\site-packages\pyannote\pipeline\pipeline.py", line 100, in getattr__ raise AttributeError(msg) AttributeError: 'SpeakerDiarization' object has no attribute 'to'

rockmor commented 1 year ago

I am having the same problem with English as well.

shruru commented 1 year ago

Hi @seset

for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.

you fixed it? We are facing same issue. but not able to fix it....

NielsVandenEynde commented 1 year ago

You can also just write your own script for merging word level timestamps into sentence level timestamps, if you want I can provide my script

liushaowei123 commented 1 year ago

Add the following

--max_line_width 42 --max_line_count 2

how to use in code

audio = whisperx.load_audio(audio_file) print("end load audio") result = model.transcribe(audio, batch_size=batch_size, max_line_width=42, max_line_count=2)

Omer-ler commented 1 year ago

Is there any solution to this problem? AttributeError: 'SpeakerDiarization' object has no attribute 'to' I got a failure in here: diarize_model = DiarizationPipeline(use_auth_token=HF_TOKEN, device=DEVICE)

Thanks!

sorgfresser commented 1 year ago

@Omer-ler are you using the correct pyannote and whisperX versions? Try it in a clean environment maybe.

kronkinatorix commented 1 year ago

I've been tearing my hair out trying to figure out why this is happening. I'm getting the no attrribute "to" error code as well.

I've tried just using pyannote, I've tried with whisperx, I've tried in clean anvironments and I've tried without.

sorgfresser commented 1 year ago

How do you install whisperx? Using a clean environment and running pip install git+https://github.com/m-bain/whisperx.git should work.

sw5813 commented 1 year ago

I'm still getting long segments in the latest, consisting of several sentences per segment rather than one sentence per segment. Is this expected? I've seen this with both English and French transcriptions.

Should be hopefully fixed here 24008aa

Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization

m-bain commented 1 year ago

@sw5813 Hmmm if there is a full-stop then this isn't expected (the sentence tokenization should split up multiple sentences). Can you print some examples?

Overall there is a trade-off here: Batch inference provides. a big speedup but loses Whisper's shortert segment timestamps.

For the next big update I will try add functionality to support both ASR backends:

native whisper (unbatched) with original timestamps
batched inference with faster-whisper without original timestamps (can cause 30s long segments, especially for non-english)

Since for some people shorter segments are most important and not speed so 1 might be a better ASR backend.

sw5813 commented 1 year ago

Sure, here's one of the outputs I got as a result of the transcription (before the alignment step):

{'segments': [ {'text': " qui permettent d'avoir un échantillon beaucoup plus vaste de patientes et d'être éventuellement plus représentatif de ce qu'on va pouvoir avoir finalement dans la vraie vie et avec les patientes qu'on va traiter. Donc c'est des choses qui sont parfaitement, qui peuvent se combiner, c'est deux types d'études totalement différentes.", 'start': 0.008, 'end': 15.483}, {'text': " mais qui vont donner aussi des informations différentes. Donc les deux avec leurs avantages, leurs inconvénients. Donc les résultats pour one hundred and fifty three thousand six hundred femmes qui ont réalisé two hundred and forty five thousand five hundred and thirty four ovarian stimulation. Donc c'est vraiment représentatif de l'AMP française. Donc entre le premier janvier deux mille treize et le trente et un décembre deux mille dix-huit. Et l'âge moyen de ces femmes était de trente-quatre virgule zéro sept ans. Donc ce qui est tout à fait en rapport avec les pratiques.", 'start': 15.483, 'end': 45.47}, {'text': " Le Système National des Données de Santé, ce fameux SNDS, est constitué des données de l'assurance-maladie, en fait, et exhaustif puisqu'on couvre, pardon, de la population au niveau de la France. Donc c'est quelque chose.", 'start': 45.47, 'end': 59.948} ], 'language': 'fr'}

FWIW I used the "suppress_numerals" setting which is why the numbers are written out, although I wonder if that may also be why there's some English that made its way into this French transcription...

seset commented 1 year ago

After few months waiting , Whisperx is still the fastest and best!

My temp solution for verbose segment issue as below:

step 1: install whisperx in editable mode:

$ git clone https://github.com/m-bain/whisperX.git $ cd whisperX $ pip install -e .

step 2: fix segement duration problem edit below line in "asr.py" , change "30" into "8", i tried 5-10 seconds , the length of subs will be all acceptable. suggest @m-bain to add argument for this...

vad_segments = merge_chunks(vad_segments, 30)

https://github.com/m-bain/whisperX/blame/1b092de19a1878a8f138f665b1467ca21b076e7e/whisperx/asr.py#L263

step3: use "--no_align" to fix extra empty space between when transcribe zh, ja or other language , or to edit "transribe.py" to set it as default. because i don't see that much break lines when not useing alignment, totally acceptable..

seset commented 1 year ago

Hi @seset

for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.

you fixed it? We are facing same issue. but not able to fix it....

refer above...

jim60105 commented 1 year ago

After few months waiting , Whisperx is still the fastest and best!

My temp solution for verbose segment issue as below:

step 1: install whisperx in editable mode:

$ git clone https://github.com/m-bain/whisperX.git $ cd whisperX $ pip install -e .

step 2: fix segement duration problem edit below line in "asr.py" , change "30" into "8", i tried 5-10 seconds , the length of subs will be all acceptable. suggest @m-bain to add argument for this...

vad_segments = merge_chunks(vad_segments, 30)

https://github.com/m-bain/whisperX/blame/1b092de19a1878a8f138f665b1467ca21b076e7e/whisperx/asr.py#L263

step3: use "--no_align" to fix extra empty space between when transcribe zh, ja or other language , or to edit "transribe.py" to set it as default. because i don't see that much break lines when not useing alignment, totally acceptable..

@seset The --chunk_size argument is added at #445. Please check if this issue is resolved. And the ja, zh space issue is also resolved #248.

m-bain / whisperX

V3 sentence segement issue #200

when i try to use diarization to see whether helpful to segements, errors below, no much luck in using --diarize ever since v3 updated.