whisperx breaks the sentence incorrectly and is not the same as whisper

Ganya-A commented 1 year ago

whisperx 7 00:00:27,870 --> 00:00:34,551 достижения и наслаждения просто для спортсменов. Сегодня в эфир детского 8 00:00:34,591 --> 00:00:39,812 радио мы позвали олимпийскую чемпионку по фигурному катанию, чемпионку мира и 9 00:00:39,852 --> 00:00:45,834 трехкратную чемпионку России Анну Щербакову. Всем привет! Всем привет! Очень

whisper 7 00:00:27,880 --> 00:00:32,734 достижения и наслаждения просто для спортсменов. 8 00:00:33,081 --> 00:00:38,595 Сегодня в эфир детского радио мы позвали олимпийскую чемпионку по фигурному катанию, 9 00:00:38,760 --> 00:00:43,333 чемпионку мира и трехкратную чемпионку России Анну Щербакову.

After I added the command "--vad_filter --align_model jonatasgrosman/wav2vec2-large-xlsr-53-russian", There are many such cases in the subtitles, which cut off in the middle of the sentence.What is causing this and can it be fixed? Thanks.

2023-02-23 15:36:43.844293: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-02-23 15:36:45.178881: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-02-23 15:36:45.179017: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-02-23 15:36:45.179043: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

This error occurred during the run and I don't know if it has any effect on the results

Pikauba commented 1 year ago

Did you tried without the --vad_filter to compare the results? It would be interesting to know if it comes from the alignement part or the VAD segments part.

Ganya-A commented 1 year ago

您是否尝试过没有 --vad_filter 来比较结果？知道它是来自对齐部分还是VAD段部分会很有趣。

I just tested it and it seems to be a problem with VAD, but whisper and VAD work together don't have this problem. The whisperx aligns by default, I don't know the command to turn off the alignment.

Performing VAD... ~~ Transcribing VAD chunk: (00:00.008 --> 00:27.650) ~~ [00:00.000 --> 00:02.000] Детское радио! [00:03.280 --> 00:07.560] Сегодня в эфире Детского радио мы отмечаем День зимних видов спорта. [00:07.560 --> 00:13.280] Уже вспомнили Олимпиаду в Сочи, поговорили об Олимпийских видах спорта. [00:13.280 --> 00:16.600] Даня вот сказал, что действительно Олимпиады — это такая гордость. [00:16.600 --> 00:20.880] И, конечно, любой спортсмен мечтает участвовать в Олимпиаде. [00:20.880 --> 00:27.760] Ну, а получить какую-то награду на Олимпиаде — это, мне кажется, вообще успех. ~~ Transcribing VAD chunk: (00:27.650 --> 00:57.637) ~~ [00:00.000 --> 00:06.840] достижения и наслаждения просто для спортсменов. Сегодня в эфир детского [00:06.840 --> 00:12.120] радио мы позвали олимпийскую чемпионку по фигурному катанию, чемпионку мира и [00:12.120 --> 00:18.120] трехкратную чемпионку России Анну Щербакову. Всем привет! Всем привет! Очень [00:18.120 --> 00:22.760] рада сегодня быть здесь. Для меня это правда очень приятно, потому что я все [00:22.760 --> 00:26.880] детство слушала детское радио, даже дозванивалась сюда в эфир, поэтому [00:26.880 --> 00:29.880] Поэтому оказаться здесь, в студии, для меня особенно классно.

Pikauba commented 1 year ago

I will investigate more about this one but it feels like this might come from theses lines from line 330 -> line 333

mel = mel[:, local_f_start:] # seek forward
prev = seg_f_start
local_mel = mel[:, :local_f_end-local_f_start]

More specifically the index slicing which might not correctly select the good part of the Mel spectrogram.

transcribe() can deal with long form (more than 30s) but if the audio is not correctly sliced this might produces abnormal results.

Furthermore, It might has something to do with the fact that the VAD segment binarization(look at this line ) can have no voice activity detection of less than 1 seconds (there is actually no minimum atm). This could theoretically lead something we would consider a single segment (with small pauses) to be considered by the VAD model as 2 segments (despite the fact that we could argue this is could be considered as a single segment) leading to a two parts transcription processing which would have perform better from being fed from a unique segment.

invisprints commented 1 year ago

Yeah, it seems the VAD filter will cut off some voice segments in my case (English lang.)

arnavmehta7 commented 1 year ago

@m-bain is it possible to make the lengths of the chunks a bit more? these 3-4 seconds chunks looks too many. I'll instead like ~20s chunks. any configs to change?

Oheed911 commented 1 year ago

Also, how can I use --vad filter in python and my own dictionary while loading the model in python code, so I don't have to load it everytime the voice file is received?

Pikauba commented 1 year ago

I am currently working on a fix for this one and might do a pull request soon. I hope you guys would be able to test it and see if it works well on some failing cases you faced.

m-bain / whisperX

whisperx breaks the sentence incorrectly and is not the same as whisper #99