Large-v3 is producing huge walls of text

xrishox commented 1 year ago

Using stable-ts with large-v3 and demucs is fairly regularly producing minute long 4-5 line blocks of speech

whereas v2 was producing far more reasonably timed and length text strings.

i've noticed this across lots of different videos and clips. anyone know what i can do to fix this?

another example v3: v2

vogelcodes commented 1 year ago

I always use --max_words to control the length of each caption.

Em qui., 9 de nov. de 2023 01:12, xrishox @.***> escreveu:

Using stable-ts with large-v3 and demucs is fairly regularly producing minute long 4-5 line blocks of speech

[image: image] https://user-images.githubusercontent.com/23132250/281613286-2e9b413e-d966-46c0-a4a7-40a0c5371c37.png

whereas v2 was producing far more reasonably timed and length text strings.

[image: image] https://user-images.githubusercontent.com/23132250/281612641-44d2b546-55c7-4e87-b651-a08fd998016b.png

i've noticed this across lots of different videos and clips. anyone know what i can do to fix this?

— Reply to this email directly, view it on GitHub https://github.com/jianfch/stable-ts/issues/248, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANCRC6RBHWWFCKLSORIVAH3YDRJ3ZAVCNFSM6AAAAAA7D5K5AOVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DINZYGQ4TKMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

xrishox commented 1 year ago

this might just be a v3 thing or a japanese thing, but max words doesn't work as you would expect. it doesn't appear to know what a word is vs a character, so it will assign breaks in the middle of words. the other problem is that when i assign max words or max chars it totally breaks the subtitle timing. for example if that 5 line block of text covers 30 seconds maybe the first half covers 10 seconds and then there is a 10 second gap and then the second half covers the last 10 seconds. the subtitles will randomly get broken up and display when people aren't talking or be showing for patches of audio that have already been spoken.

jianfch commented 1 year ago

Try large-v3 with model.transcribe_minimal(). If it behaves the same way, it is likely performance issues with the model itself. Else it might be a bug. If you can share the audio clip with the results you got for it as a JSON, it will make it easier to pinpoint the cause.

xrishox commented 1 year ago

wow setting transcribe minimal actually produces WAY better results. default: minimal: output.zip

i uploaded a zip file. it has the srt and the json for a video file for both the default settings and transcribe_minimal and transcribe_minimal is far better. the default has multiple 3 + line outputs whereas all of the minimal outputs into 2 or less. both are using large-v3. i can't tell if the json is supposed to look like that or if that's a consequence of it encoding the japanese into like unicode or something.

you can see an example of default being incredibly long at 1 minute exactly and then compare it to minimal which looks normal

after testing it a bit more it seems like the minimal v3 is ignoring/missing a lot of lines that regular v2 would pick up. i imagine that's because minimal v3 is missing all of that demucs etc stuff

tslater commented 12 months ago

@xrishox, curious what you're working on?

xrishox commented 12 months ago

trying to learn japanese. having subtitles to read can help quite a bit.

I've basically decided to just make multiple sets of subs for each file. i make a minimal v2, and minimal v3 and then a non-minimal v2 with all the vad and demucs and stuff. that way between the 3 of those at least one of them will usually have a reasonably high degree of accuracy.

i've found minimal v2 to be the most accurate so far.

quopquai commented 10 months ago

@xrishox would you be willing to share the settings you use to get your results?

I'm messing around with the settings myself at the moment (for Japanese also) and have come up with the following:

stable-ts audio.mp3 --model large-v2 --language Japanese --fp16=False --word_level=False --verbose 2 --output_format srt --transcribe_method transcribe_minimal

Does this look something like what you are doing too? the results are the best I've had yet, but just curious incase I'm missing something?!

xrishox commented 10 months ago

find . -type f -name "*.mkv" -exec sh -c 'if [ ! -f "${1%.mkv}.ko.json" ]; then stable-ts "$1" -o "${1%.mkv}.ko.json" --model large-v2  --language ja --transcribe_method transcribe_minimal; fi' _ {} \; &&
find . -type f -name "*.mkv" -exec sh -c 'if [ ! -f "${1%.mkv}.es.json" ]; then stable-ts "$1" -o "${1%.mkv}.es.json" --model large-v3  --language ja --transcribe_method transcribe_minimal; fi' _ {} \; &&
find . -type f -name "*.ko.json" -exec sh -c 'if [ ! -f "${1%.ko.json}.ko.srt" ]; then stable-ts "$1" -o "${1%.ko.json}.ko.srt" --word_level false ; fi' _ {} \;    &&
find . -type f -name "*.ko.json" -exec sh -c 'if [ ! -f "${1%.ko.json}.zh.srt" ]; then stable-ts "$1" -o "${1%.ko.json}.zh.srt" --word_level true ; fi' _ {} \; &&
find . -type f -name "*.es.json" -exec sh -c 'if [ ! -f "${1%.es.json}.pt.srt" ]; then stable-ts "$1" -o "${1%.es.json}.pt.srt" --word_level true ; fi' _ {} \; &&
find . -type f -name "*.es.json" -exec sh -c 'if [ ! -f "${1%.es.json}.es.srt" ]; then stable-ts "$1" -o "${1%.es.json}.es.srt" --word_level false ; fi' _ {} \; &&
find . -type f -name "*.mkv" -exec sh -c 'if [ ! -f "${1%.mkv}.it.srt" ]; then stable-ts "$1" -o "${1%.mkv}.it.srt" --model large-v2 --refine --language ja --demucs true --word_level false; fi' _ {} \;

is what i run. this searches for mkv files, does a transcribe minimal of largev2 and largev3 and outputs the subtitles into word level true and word level false for each of them (only if the subtitles don't already exist, if they do it skips that file), and then the 5th one it does is a largev2 with refine and demucs. you should be able to modify this from mkv to mp3 or whatever else.

quopquai commented 10 months ago

Wow, that's way more than I'm capable of doing! :O Is that all a command line prompt? I'm kinda scared to even touch that? The main prompts are understandable though. Thanks for sharing!

seetimee commented 10 months ago

Chinese generation has the same problem. Large v2 produces better results than large v3 in long text problem

MengHao666 commented 8 months ago

l v2 to be the most accurate so far.

what is minmal v2 ?

jianfch / stable-ts

Large-v3 is producing huge walls of text #248