Open lfcnassif opened 1 year ago
Thanks I was aware about the first reference, not about the second. But I didn't finish, I will try to normalize numbers and try to run wav2vec2 with a language model.
Hi @lfcnassif ,
I did some tests with fast-whisper. With that test script of yours and replacing the contents of the 'Wav2Vec2Process.py' file. I also got the model made by @DHoelz ( dwhoelz/whisper-medium-pt-ct2 )
I got better transcriptions than wav2vec, but the performance is worse, 2x slower.
The strange thing is that when configuring the OMP_NUM_THREADS thread parameter with half of the total logical cores, I got better performance. Both locally and on the IPED transcription server.
I also managed to get the 'finalscore' on fast-whisper. Then check if it is correct.
Below, the result of a small test I did running IPED Server ( Only in CPU mode ).
Machine: 2 sockets - 24 logical cores ( 2 python process for transcription )
OMP_NUM_THREADS = num of threads
10 audios 530 seconds - 12 threads ( total cpu usage 100%) 10 audios 509 seconds - 6 threads ( total cpu usage 60%)
Perhaps the best configuration of "OMP_NUM_THREADS" is:
import psutil logical_cores = psutil.cpu_count(logical=True) cpu_sockets = 2 #Find a way to get this value in Python threads = int(logical_cores/cpu_sockets/2) os.environ["OMP_NUM_THREADS"] = str(threads)
Now a question. Would it be possible to make the wav2vec2 remote server more generic to also accept fast-whisper (Through configuration parameter)?
I was also able to get fast-whisper to work offline.
Modified script to compute the finalscore.
`
import sys
import numpy
stdout = sys.stdout
sys.stdout = sys.stderr
terminate = 'terminate_process'
model_loaded = 'wav2vec2_model_loaded'
huggingsound_loaded = 'huggingsound_loaded'
finished = 'transcription_finished'
ping = 'ping'
def main():
modelName = 'medium'
modelName = 'dwhoelz/whisper-medium-pt-ct2'
#modelName = sys.argv[1]
deviceNum = sys.argv[2]
import os
os.environ["OMP_NUM_THREADS"] = "6"
from faster_whisper import WhisperModel
print(huggingsound_loaded, file=stdout, flush=True)
#import torch
#cudaCount = torch.cuda.device_count()
# Run just on CPU for now
cudaCount = 0
print(str(cudaCount), file=stdout, flush=True)
if cudaCount > 0:
deviceId = 'cuda:' + deviceNum
else:
deviceId = 'cpu:'
try:
model = WhisperModel(modelName, device=deviceId, compute_type="int8")
except Exception as e:
if deviceId != 'cpu':
# loading on GPU failed (OOM?), try on CPU
deviceId = 'cpu'
model = WhisperModel(model_size_or_path=modelName, device=deviceId, compute_type="int8")
else:
raise e
print(model_loaded, file=stdout, flush=True)
print(deviceId, file=stdout, flush=True)
while True:
line = input()
if line == terminate:
break
if line == ping:
print(ping, file=stdout, flush=True)
continue
transcription = ''
probs = []
try:
segments, info = model.transcribe(audio=line, language='pt', beam_size=5, word_timestamps=True)
for segment in segments:
transcription += segment.text
words = segment.words
if words is not None:
probs += [word.probability for word in words]
except Exception as e:
msg = repr(e).replace('\n', ' ').replace('\r', ' ')
print(msg, file=stdout, flush=True)
continue
text = transcription.replace('\n', ' ').replace('\r', ' ')
probs = probs if len(probs) != 0 else [0]
finalScore = numpy.average(probs)
print(finished, file=stdout, flush=True)
print(str(finalScore), file=stdout, flush=True)
print(text, file=stdout, flush=True)
return
if __name__ == "__main__":
main()
`
Thank you @gfd2020!
I got better transcriptions than wav2vec, but the performance is worse, 2x slower.
Did you measure WER or used other evaluation metric?
I also managed to get the 'finalscore' on fast-whisper. Then check if it is correct.
Thank you very much, that is very important!
Now a question. Would it be possible to make the wav2vec2 remote server more generic to also accept fast-whisper (Through configuration parameter)?
Sure. That is the goal, the final integration will use a configuration approach.
Did you measure WER or used other evaluation metric?
Unfortunately I did not measure WER. It was just a manual checking of the texts obtained. Whisper model also have punctuation and caps and take less memory, in my case 1-1.5 GB per python process.
Try whisper.cpp.
Seems whisper.cpp improved a lot since last time I tested. Now they have NVIDIA GPU support:
https://github.com/ggerganov/whisper.cpp#nvidia-gpu-support
It may be worth another try, what do you think @fsicoli?
It may be worth another try
Tested the speed some minutes ago: for a 434s audio, the medium model took 35s and the large-v3 model took 65s to transcribe using 1 RTX 3090. Seems a bit faster than faster-whisper on that GPU.
Tested the speed some minutes ago: for a 434s audio, the medium model took 35s and the large-v3 model took 65s to transcribe using 1 RTX 3090. Seems a bit faster than faster-whisper on that GPU.
Is there some snapshot for testing? Or script we could put in iped as the above.
Is there some snapshot for testing? Or script we could put in iped as the above.
No, I just did a preliminary test of whisper.cpp directly on a single audio from command line without IPED.
I changed the parameter from beam_size=5 to beam_size=1 and the performance improved by 35% and the quality was more or less the same.
Is there some snapshot for testing? Or script we could put in iped as the above.
No, I just did a preliminary test of whisper.cpp directly on a single audio from command line without IPED.
If it is integrated into iped, would it be via java JNA and the DLL?
If it is integrated into iped, would it be via java JNA and the DLL?
You mean this? https://github.com/ggerganov/whisper.cpp/blob/master/bindings/java/README.md
Possibly. Since native code directly linked may cause application crashes (like I experimented with faster-whisper), there are other options too, like whisper server: https://github.com/ggerganov/whisper.cpp/tree/master/examples/server
Or a custom server process code without the http overhead.
I also fiddled around with several Whisper solutions and ended with a simple client-server solution.
On the one hand there ist an IPED python task which pushes all audio and video files for further processing to a network share. On the other hand there is a separate background process which wathces those shares, transcribes and translates the media files and writes back a JSON file with the results. These JSON files are finally parsed by the IPED task and merged into the metadata of the files.
This gives you three advantages:
Here are the repositories for the task and the background process:
Maybe you find the solution useful.
Greetings, Ronny
Thanks @hilderonny for sharing your solution!
Which Whisper implementation are you using? Standard whisper, faster-whisper, whisper.cpp, whisper-jax?
I am using faster-whisper because this implementation is also able to separate speakers by splitting up the transcription into parts and is a lot faster in processing the media files.
I'm evaluating other 3 Whisper implementations: Whisper.cpp, Insanely-Fast-Whisper and WhisperX. The last 2 are much much faster for long audios, since they break them into 30s pieces and execute batch inference on many audio segments at the same time, at the cost of higher GPU VRAM usage. Quoting https://github.com/sepinf-inc/IPED/pull/2165#issuecomment-2058175055:
Speed numbers over a single 442s audio using 1 RTX 3090, medium model, float16 precision (except whisper.cpp since it can't be set):
Running over the 151 small real world audios dataset with total duration of 2758s:
PS: Whisper.cpp seems to parallelize better than others using multiple processes, so its last number could be improved. PS2: For inference on CPU, Whisper.cpp is faster than Faster-Whisper by ~35%, not sure if I will time all of them on CPU... PS3: Using large-v3 model within Whisper.cpp, it produced hallucinations (repeated texts and a few non existing texts), it was also observed with Faster-Whisper in a lower level.
Updating WER stats with WhisperX + largeV2 model:
Average WER difference to Faster-Whisper + largeV2 model is just +0.0018. WhisperX was the best on TEDx data set until now, a spontaneous speak data set, also better than FasterWhisper + Jonatas Grosman's portuguese fine tuned largeV2 model.
PS: I'll review WER of Faster-Whisper + largeV2 on VoxForge, it seems an outlier.
PS2: WhisperX + largeV2 took 3h30 to transcribe all data sets (29h duration) while Faster-Whisper + largeV2 took 5h on RTX 3090.
Updating results with WhisperX + medium model:
It took 2h30 to transcribe all 29h data set while FasterWhisper + medium took 3h30 (both using float16 and beam_size=5) on RTX 3099.
I also fixed Faster-Whisper + largeV2 on VoxForge, it was missing a zero...
Updating results with WhisperX + LargeV3 model and WhisperX + JonatasGrosman's LargeV2 fine tuned model for portuguese:
WhisperX + LargeV3 model took 3h30m WhisperX + JonatasGrosman's LargeV2 model took 3h45m
I'll try to prototype a Whisper.cpp integration to make its WER evaluation easier on those data sets, since it is faster on CPU and could be an option for some users.
PS: I think default Whisper + LargeV2 model WER numbers are quite strange, I'll review them too.
Revised numbers for Whisper reference impl + Largev2 model, it didn't change that much:
I'm running WER evaluation with Whisper.cpp implementation and will post them soon.
Updating stats with Whisper.cpp (medium, largeV2 & largeV3 models), Faster-Whisper + LargeV3, running time of Whisper models and number of empty transcriptions from all 22,246 audios:
Comments:
To finish this evaluation, 2 tasks are needed:
Speed numbers over a single 442s audio using 1 RTX 3090, medium model, float16 precision (except whisper.cpp since it can't be set):
- Faster-Whisper took ~36s
- Whisper.cpp took ~31s
- Insanely-Fast-Whisper took ~7s
- WhisperX took ~5s
Transcribing the same 442s audio, medium model, int8 precision (except whisper.cpp since it can't be set), but on a 24 threads CPU:
Just found this study from March 2024: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription
It also supports our current choice for WhisperX is a good one (I'm just not very happy with WhisperX dependencies size...)
Just found this study from March 2024: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription
It also supports our current choice for WhisperX is a good one (I'm just not very happy with WhisperX dependencies size...)
Cant it be put as an option? So who decides to use WhisperX instead Faster-Whisper has to download dependencies pack, as pytorch and others.
Cant it be put as an option? So who decides to use WhisperX instead Faster-Whisper has to download dependencies pack, as pytorch and others.
It could, but I don't plan to do, package size is the less important aspect to us in my opinion (otherwise we should use Whisper.cpp, it is very small), WhisperX has similar accuracy, is generally faster and much faster with long audios, that's more important from my point of view. And keeping 2 different implementations increases the maintenance effort.
Hi @marcus6n. How is the real world audio-transcription data set curation going? It's just 1h of audios, do you think you can finish it today or on Thursday?
Started the evaluation on the 1h real world non public audio data set yesterday, thanks @marcus6n for double checking the transcriptions and @wladimirleite for sending half of them! Preliminary results below (averages still not updated):
Recently made public: https://openai.com/blog/whisper/ https://github.com/openai/whisper
Interesting, they have some multilingual models that can be used for multiple languages without fine tuning for each language. They claim their models generalize better than models that need fine tuning, like wav2vec. Some numbers on Fleurs dataset (e.g. 4.8% WER on Portuguese subset): https://github.com/openai/whisper#available-models-and-languages