m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.24k stars 1.18k forks source link

V3.1 is great but sentences are too long together to actual use... #232

Open shruru opened 1 year ago

shruru commented 1 year ago

Accuracy is great, Speed is super great , However the SRT/txt output file is hardly for using as Subtitle. Since these generated sentences are way way too long together....Hope to fix this asap.

Thanks.

sorgfresser commented 1 year ago

Did you specify max_line_width and max_line_count? Try --max_line_width 42 --max_line_count 2

shruru commented 1 year ago

example: Test mp4 - 00 38 333

shruru commented 1 year ago

Did you specify max_line_width and max_line_count? Try --max_line_width 42 --max_line_count 2

Thanks, I will try it ,but from what I assume..it's like "hard break"?...will it make punctuation to a mass? Unwanted places to be breaked the line, it's just my thoughts.

sorgfresser commented 1 year ago

Give it a shot, @m-bain did some good work. Shouldn't be too bad.

shruru commented 1 year ago

Give it a shot, @m-bain did some good work. Shouldn't be too bad.

Hi @sorgfresser , I tried. but the result as I said, it's just "hard break" , what I got something like these, you will see, the break lines are totally "wrong place" so that the subtitle is "unreadable" at all.

image
sorgfresser commented 1 year ago

That's odd. There has been a commit recently (24008aa) that should address this. Are you using the newest version?

shruru commented 1 year ago

That's odd. There has been a commit recently (24008aa) that should address this. Are you using the newest version?

yes, it's V3.1 latest. it's not an actual solution yet. yes. it limites the lenght of the line however it also make the sentences unreadable.

I know it's tough to make the "right" break line. Hopefully we can refer the old-school Adobe PR's approach from Speech-to-Text function. they are not even AI-powered however they use these 3 options to ge the job done. Hope it helps.

https://helpx.adobe.com/premiere-pro/using/speech-to-text.html

image
chenlung commented 1 year ago

See guidelines at Netflix.

shruru commented 1 year ago

See guidelines at Netflix.

don't get it. We know the rules but not going to do it manually.

chenlung commented 1 year ago

I didn't know if you know the rules (and it can also be for the benefit of others since it wasn't mentioned here). One thing that would help is not breaking linguistic units.

m-bain commented 1 year ago

Give it a shot, @m-bain did some good work. Shouldn't be too bad.

Hi @sorgfresser , I tried. but the result as I said, it's just "hard break" , what I got something like these, you will see, the break lines are totally "wrong place" so that the subtitle is "unreadable" at all.

image

Yes the current setup is not perfect because the VAD segments can cut up sentences, especially if there is a pause in the middle of a really long sentence.

E.g.

"This is an example of a really really long sentence whereby..."
"Someone takes a pause. Then keeps on speaking"

Currently the sentence tokenization assumes the entirety of a sentence is within a single segment. But this is not always the case, can try performing sentence tokenization across neighbouring segments (e.g. within some maximum gap of say 2 seconds)

gvonkreisler commented 1 year ago

Dear m-bain,

I cant understand why every thing that worked good in v2 is now broken. Only to make every thing faster... For a translation (subtitles) precision is it all !!! And at least: Speed up only works, if you have a very new GPU, my K80s or 1080 cant work with int8, or 16bit. So no speed up for the simple user at all.

Sorry. George

oep42 commented 1 year ago

Is it still possible to install v2?

If so, how?

rockmor commented 1 year ago

@m-bain I agree that right now v3 is pretty much unusable and needs to be worked on a bit more. I suggest you add to README.md an option to download the old v2 as it was before the update for now. (Btw, when I try to download it with pip install git+https://github.com/m-bain/whisperx.git@v2.0.1, it downloads with errors and doesn't work).

Sorry for all the complaining and thank you for your hard work! :) I think what you've done so far is incredible and hope all the problems will be solved.

rockmor commented 1 year ago

For those interested, I somehow managed to make v2 work. First, I uninstalled existing whisperx: pip uninstall whisperx Then, pip install git+https://github.com/m-bain/whisperx.git@v2.0.1 At this point, it didn't work, and I was getting this error:

ModuleNotFoundError: No module named 'pyannote.audio'

I went pip uninstall torchaudio, then pip install torchaudio==0.13.1 After that, whisperx started to work, but only through CPU (so, extremely slowly). So, then I uninstalled torch and reinstalled it with pip3 install torch==1.13.1 --index-url https://download.pytorch.org/whl/cu117

rockmor commented 1 year ago

The issue stays in v.3.1.1.

image

ardha27 commented 1 year ago

For those interested, I somehow managed to make v2 work. First, I uninstalled existing whisperx: pip uninstall whisperx Then, pip install git+https://github.com/m-bain/whisperx.git@v2.0.1 At this point, it didn't work, and I was getting this error:

ModuleNotFoundError: No module named 'pyannote.audio'

I went pip uninstall torchaudio, then pip install torchaudio==0.13.1 After that, whisperx started to work, but only through CPU (so, extremely slowly). So, then I uninstalled torch and reinstalled it with pip3 install torch==1.13.1 --index-url https://download.pytorch.org/whl/cu117

thank you, i already implement it on my colab. https://github.com/ardha27/WhisperX-Youtube-SRT

shruru commented 1 year ago

For those interested, I somehow managed to make v2 work. First, I uninstalled existing whisperx: pip uninstall whisperx Then, pip install git+https://github.com/m-bain/whisperx.git@v2.0.1 At this point, it didn't work, and I was getting this

Is it able to contact you for more help about downgrade ? Thank you.

rockmor commented 1 year ago

@shruru Added email to the profile.

azhitian commented 1 year ago

I'm still dealing with this issue and with the spaces between every character issue (for Chinese, mentioned here for Japanese #248). With the current version, lines in the srt file are way too long, and it doesn't seem like the nltk sentence tokenizer is great at breaking up Chinese (or some information from the original transcription is lost somehow, as often utterances from separate speakers are treated as part of the same sentence).

The --max_line_width/--max_line_count arguments seem like they just produce additional entries in the srt file with the same timestamp, which is not good enough for my use case. I need shorter lines where the beginning and end of each line is actually aligned.

My current ugly workaround is to just split sentences over 30 characters right after sentences are tokenized:

sentence_spans = list(nltk.tokenize.punkt.PunktSentenceTokenizer().span_tokenize(text))

MAX_SENTENCE_LENGTH = 30

new_spans = []
for left, right in sentence_spans:
    while right - left > MAX_SENTENCE_LENGTH:
        new_spans.append((left, left + MAX_SENTENCE_LENGTH - 1))
        left = left + MAX_SENTENCE_LENGTH
    new_spans.append((left, right))
sentence_spans = new_spans

A better approach for making shorter lines would probably be splitting at gaps between characters over a fixed threshold, (maybe 2 seconds), but I haven't dug enough into the code to figure out how to do that yet, and I think it still wouldn't reliably produce line breaks that made sense semantically as v2 nicely did (I assume by using the timecodes from the actual Whisper run).

patdelphi commented 1 year ago

Pls refer the project: https://github.com/Softcatala/whisper-ctranslate2, the segments is perfect.

sorgfresser commented 1 year ago

@m-bain I'm pretty sure whisper-ctranslate2 violates your BSD btw. Their writers.py matches your utils.py in every way and I was unable to find the

This product includes software developed by Max Bain.

sorgfresser commented 1 year ago

@patdelphi what Do you mean by perfect? The length of the segments could be better, that's true. Regarding the alignment it should be more off than whisperX since it does use Whispers internal timestamps rather than Wav2Vec2 to force align the transcription afterwards.

rohitkrishna094 commented 1 year ago

Did you specify max_line_width and max_line_count? Try --max_line_width 42 --max_line_count 2

@sorgfresser just curious, how to find out all the command line options that I can pass? Like where/how did you know that whisperx can take command line argument called max_line_width or max_line_count?

sorgfresser commented 1 year ago

@rohitkrishna094 take a look at the transcribe.py in the whisperx directory. They are listed there.

bjcodereview3 commented 1 year ago

Did you specify max_line_width and max_line_count? Try --max_line_width 42 --max_line_count 2

@sorgfresser just curious, how to find out all the command line options that I can pass? Like where/how did you know that whisperx can take command line argument called max_line_width or max_line_count?

whisperx -h

jim60105 commented 1 year ago

@shruru A new argument --chunk_size has been added at #445. Please check if this resolves your issue.

ipeevski commented 7 months ago

@jim60105 It helps (when reducing the chunk_size value), but doesn't resolve the problem.

And I assume, if you make the chunk_size value too small, it will make the transcription worse too?!

It seems like it's not doing processing properly on some parts. Punctuation and capitalization is inconsistent within the same file (sometimes i/I is capitalized, sometimes there are full stops and sometimes not, etc)

jimi202008 commented 4 months ago

Use --chunk_size 10 The default settings for whisperx are terrible it's best to reset them with whisperx -h see instructions

sempiternoiddqd commented 3 months ago

I'm working with --chunk_size 5 IT'S PERFECT! I updated to the latest version with large-v3 no more 5-lines-subtitles.

foolishgrunt commented 4 days ago

I agree that the best workaround, at this time, is providing an appropriate value for --chunk_size. However, in my case, this often leaves "orphaned" words or (or short phrases) in their own segment when they clearly belong to either the segment before or after. I have to manually comb through the subtitles afterwards and correct these instances.

Also, relying on --chunk_size does not adequately account for times when the speaker increases his rate of speech, making a previously sane chunk size suddenly too long. In one instance, in a few instances, I've ended up with 80-100 characters in a single segment.

@sorgfresser previously suggested --max_line_width 42 --max_line_count 2. However, when I call this, sometimes pieces of a segment that would otherwise exceed the character limit are shifted to a shorter neighboring segment... without adjusting the timestamps of either segment. The results are unusable (unlike the results with --chunk_size, which are merely messy at times).

I would love to see the capabilities of --max_line_width and --max_line_count further refined, as this seems like it would be the most ideal solution.