jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 176 forks source link

Real world test #288

Closed ls-milkyway closed 10 months ago

ls-milkyway commented 10 months ago

I have tested the video: https://www.youtube.com/watch?v=1NfFIpZocWs against models:

1) Large-v3

2) Medium model with command (since dialogues and music is sort of mixed up): stable-ts test.mp4 --model medium --model_dir D:***els\ --output testmedium.srt --language Japanese --vad True --vad_threshold 0.35 --demucs True --refine

3) Medium model using just: stable-ts test.mp4 --model medium --model_dir D:***els\ --output testmedium.srt --language Japanese

The result is: A) Large-v3 wins (not only in quality but also has best sync), B) Second place (strangely command no. 3 outputs better) and C) Last is option 2 (command taking more time)??

The transcribes were done in native language i.e Japanese and the converted to english for comparisons using Google API Version 1