sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.
Other
893 stars 214 forks source link

Evaluate Whisper transcription algorithm #1335

Open lfcnassif opened 1 year ago

lfcnassif commented 1 year ago

Recently made public: https://openai.com/blog/whisper/ https://github.com/openai/whisper

Interesting, they have some multilingual models that can be used for multiple languages without fine tuning for each language. They claim their models generalize better than models that need fine tuning, like wav2vec. Some numbers on Fleurs dataset (e.g. 4.8% WER on Portuguese subset): https://github.com/openai/whisper#available-models-and-languages

lfcnassif commented 11 months ago

Thanks I was aware about the first reference, not about the second. But I didn't finish, I will try to normalize numbers and try to run wav2vec2 with a language model.

gfd2020 commented 7 months ago

Hi @lfcnassif ,

I did some tests with fast-whisper. With that test script of yours and replacing the contents of the 'Wav2Vec2Process.py' file. I also got the model made by @DHoelz ( dwhoelz/whisper-medium-pt-ct2 )

I got better transcriptions than wav2vec, but the performance is worse, 2x slower.

The strange thing is that when configuring the OMP_NUM_THREADS thread parameter with half of the total logical cores, I got better performance. Both locally and on the IPED transcription server.

I also managed to get the 'finalscore' on fast-whisper. Then check if it is correct.

Below, the result of a small test I did running IPED Server ( Only in CPU mode ).

Machine: 2 sockets - 24 logical cores ( 2 python process for transcription )

OMP_NUM_THREADS = num of threads

10 audios 530 seconds - 12 threads ( total cpu usage 100%) 10 audios 509 seconds - 6 threads ( total cpu usage 60%)

Perhaps the best configuration of "OMP_NUM_THREADS" is:

import psutil logical_cores = psutil.cpu_count(logical=True) cpu_sockets = 2 #Find a way to get this value in Python threads = int(logical_cores/cpu_sockets/2) os.environ["OMP_NUM_THREADS"] = str(threads)

Now a question. Would it be possible to make the wav2vec2 remote server more generic to also accept fast-whisper (Through configuration parameter)?

I was also able to get fast-whisper to work offline.

Modified script to compute the finalscore.

`

code

    import sys
import numpy
stdout = sys.stdout
sys.stdout = sys.stderr

terminate = 'terminate_process'
model_loaded = 'wav2vec2_model_loaded'
huggingsound_loaded = 'huggingsound_loaded'
finished = 'transcription_finished'
ping = 'ping'

def main():

    modelName = 'medium'
    modelName = 'dwhoelz/whisper-medium-pt-ct2'
    #modelName = sys.argv[1]

    deviceNum = sys.argv[2]

    import os
    os.environ["OMP_NUM_THREADS"] = "6"

    from faster_whisper import WhisperModel

    print(huggingsound_loaded, file=stdout, flush=True)

    #import torch
    #cudaCount = torch.cuda.device_count()

    # Run just on CPU for now
    cudaCount = 0

    print(str(cudaCount), file=stdout, flush=True)

    if cudaCount > 0:
        deviceId = 'cuda:' + deviceNum
    else:
        deviceId = 'cpu:'

    try:
        model = WhisperModel(modelName, device=deviceId, compute_type="int8")

    except Exception as e:
        if deviceId != 'cpu':
            # loading on GPU failed (OOM?), try on CPU
            deviceId = 'cpu'
            model = WhisperModel(model_size_or_path=modelName, device=deviceId, compute_type="int8")
        else:
            raise e

    print(model_loaded, file=stdout, flush=True)
    print(deviceId, file=stdout, flush=True)

    while True:

        line = input()

        if line == terminate:
            break
        if line == ping:
            print(ping, file=stdout, flush=True)
            continue

        transcription = ''
        probs = []
        try:
            segments, info = model.transcribe(audio=line, language='pt', beam_size=5, word_timestamps=True)
            for segment in segments:
                transcription += segment.text
                words = segment.words
                if words is not None:
                    probs += [word.probability for word in words]            
        except Exception as e:
            msg = repr(e).replace('\n', ' ').replace('\r', ' ')
            print(msg, file=stdout, flush=True)
            continue

        text = transcription.replace('\n', ' ').replace('\r', ' ')

        probs = probs if len(probs) != 0 else [0]
        finalScore = numpy.average(probs)        

        print(finished, file=stdout, flush=True)
        print(str(finalScore), file=stdout, flush=True)
        print(text, file=stdout, flush=True)

    return

if __name__ == "__main__":
     main()

`

lfcnassif commented 7 months ago

Thank you @gfd2020!

I got better transcriptions than wav2vec, but the performance is worse, 2x slower.

Did you measure WER or used other evaluation metric?

I also managed to get the 'finalscore' on fast-whisper. Then check if it is correct.

Thank you very much, that is very important!

Now a question. Would it be possible to make the wav2vec2 remote server more generic to also accept fast-whisper (Through configuration parameter)?

Sure. That is the goal, the final integration will use a configuration approach.

gfd2020 commented 7 months ago

Did you measure WER or used other evaluation metric?

Unfortunately I did not measure WER. It was just a manual checking of the texts obtained. Whisper model also have punctuation and caps and take less memory, in my case 1-1.5 GB per python process.

lfcnassif commented 7 months ago

Try whisper.cpp.

Seems whisper.cpp improved a lot since last time I tested. Now they have NVIDIA GPU support:

https://github.com/ggerganov/whisper.cpp#nvidia-gpu-support

It may be worth another try, what do you think @fsicoli?

lfcnassif commented 7 months ago

It may be worth another try

Tested the speed some minutes ago: for a 434s audio, the medium model took 35s and the large-v3 model took 65s to transcribe using 1 RTX 3090. Seems a bit faster than faster-whisper on that GPU.

rafael844 commented 7 months ago

Tested the speed some minutes ago: for a 434s audio, the medium model took 35s and the large-v3 model took 65s to transcribe using 1 RTX 3090. Seems a bit faster than faster-whisper on that GPU.

Is there some snapshot for testing? Or script we could put in iped as the above.

lfcnassif commented 7 months ago

Is there some snapshot for testing? Or script we could put in iped as the above.

No, I just did a preliminary test of whisper.cpp directly on a single audio from command line without IPED.

gfd2020 commented 7 months ago

I changed the parameter from beam_size=5 to beam_size=1 and the performance improved by 35% and the quality was more or less the same.

gfd2020 commented 7 months ago

Is there some snapshot for testing? Or script we could put in iped as the above.

No, I just did a preliminary test of whisper.cpp directly on a single audio from command line without IPED.

If it is integrated into iped, would it be via java JNA and the DLL?

lfcnassif commented 7 months ago

If it is integrated into iped, would it be via java JNA and the DLL?

You mean this? https://github.com/ggerganov/whisper.cpp/blob/master/bindings/java/README.md

Possibly. Since native code directly linked may cause application crashes (like I experimented with faster-whisper), there are other options too, like whisper server: https://github.com/ggerganov/whisper.cpp/tree/master/examples/server

Or a custom server process code without the http overhead.

hilderonny commented 6 months ago

I also fiddled around with several Whisper solutions and ended with a simple client-server solution.

On the one hand there ist an IPED python task which pushes all audio and video files for further processing to a network share. On the other hand there is a separate background process which wathces those shares, transcribes and translates the media files and writes back a JSON file with the results. These JSON files are finally parsed by the IPED task and merged into the metadata of the files.

This gives you three advantages:

  1. You serialize the processing of the files even when you have many workers, so that you can transcribe even on a machine with low computation power and with a smaller GPU.
  2. The results are indexed by IPED and can be searched via keywords.
  3. You can start as many byckground processes on as many different network machines as you want to speed up the processing. This helped me with a case with thousands of arabic voice messages to process.

Here are the repositories for the task and the background process:

Maybe you find the solution useful.

Greetings, Ronny

lfcnassif commented 6 months ago

Thanks @hilderonny for sharing your solution!

Which Whisper implementation are you using? Standard whisper, faster-whisper, whisper.cpp, whisper-jax?

hilderonny commented 6 months ago

I am using faster-whisper because this implementation is also able to separate speakers by splitting up the transcription into parts and is a lot faster in processing the media files.

lfcnassif commented 3 months ago

I'm evaluating other 3 Whisper implementations: Whisper.cpp, Insanely-Fast-Whisper and WhisperX. The last 2 are much much faster for long audios, since they break them into 30s pieces and execute batch inference on many audio segments at the same time, at the cost of higher GPU VRAM usage. Quoting https://github.com/sepinf-inc/IPED/pull/2165#issuecomment-2058175055:

Speed numbers over a single 442s audio using 1 RTX 3090, medium model, float16 precision (except whisper.cpp since it can't be set):

Running over the 151 small real world audios dataset with total duration of 2758s:

PS: Whisper.cpp seems to parallelize better than others using multiple processes, so its last number could be improved. PS2: For inference on CPU, Whisper.cpp is faster than Faster-Whisper by ~35%, not sure if I will time all of them on CPU... PS3: Using large-v3 model within Whisper.cpp, it produced hallucinations (repeated texts and a few non existing texts), it was also observed with Faster-Whisper in a lower level.

lfcnassif commented 3 months ago

Updating WER stats with WhisperX + largeV2 model: image

Average WER difference to Faster-Whisper + largeV2 model is just +0.0018. WhisperX was the best on TEDx data set until now, a spontaneous speak data set, also better than FasterWhisper + Jonatas Grosman's portuguese fine tuned largeV2 model.

PS: I'll review WER of Faster-Whisper + largeV2 on VoxForge, it seems an outlier.

PS2: WhisperX + largeV2 took 3h30 to transcribe all data sets (29h duration) while Faster-Whisper + largeV2 took 5h on RTX 3090.

lfcnassif commented 3 months ago

Updating results with WhisperX + medium model: image

It took 2h30 to transcribe all 29h data set while FasterWhisper + medium took 3h30 (both using float16 and beam_size=5) on RTX 3099.

I also fixed Faster-Whisper + largeV2 on VoxForge, it was missing a zero...

lfcnassif commented 3 months ago

Updating results with WhisperX + LargeV3 model and WhisperX + JonatasGrosman's LargeV2 fine tuned model for portuguese: image

WhisperX + LargeV3 model took 3h30m WhisperX + JonatasGrosman's LargeV2 model took 3h45m

I'll try to prototype a Whisper.cpp integration to make its WER evaluation easier on those data sets, since it is faster on CPU and could be an option for some users.

PS: I think default Whisper + LargeV2 model WER numbers are quite strange, I'll review them too.

lfcnassif commented 2 months ago

Revised numbers for Whisper reference impl + Largev2 model, it didn't change that much: image

I'm running WER evaluation with Whisper.cpp implementation and will post them soon.

lfcnassif commented 2 months ago

Updating stats with Whisper.cpp (medium, largeV2 & largeV3 models), Faster-Whisper + LargeV3, running time of Whisper models and number of empty transcriptions from all 22,246 audios: image

Comments:

To finish this evaluation, 2 tasks are needed:

lfcnassif commented 2 months ago

Speed numbers over a single 442s audio using 1 RTX 3090, medium model, float16 precision (except whisper.cpp since it can't be set):

  • Faster-Whisper took ~36s
  • Whisper.cpp took ~31s
  • Insanely-Fast-Whisper took ~7s
  • WhisperX took ~5s

Transcribing the same 442s audio, medium model, int8 precision (except whisper.cpp since it can't be set), but on a 24 threads CPU:

lfcnassif commented 2 months ago

Just found this study from March 2024: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription

It also supports our current choice for WhisperX is a good one (I'm just not very happy with WhisperX dependencies size...)

rafael844 commented 2 months ago

Just found this study from March 2024: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription

It also supports our current choice for WhisperX is a good one (I'm just not very happy with WhisperX dependencies size...)

Cant it be put as an option? So who decides to use WhisperX instead Faster-Whisper has to download dependencies pack, as pytorch and others.

lfcnassif commented 2 months ago

Cant it be put as an option? So who decides to use WhisperX instead Faster-Whisper has to download dependencies pack, as pytorch and others.

It could, but I don't plan to do, package size is the less important aspect to us in my opinion (otherwise we should use Whisper.cpp, it is very small), WhisperX has similar accuracy, is generally faster and much faster with long audios, that's more important from my point of view. And keeping 2 different implementations increases the maintenance effort.

lfcnassif commented 2 months ago

Hi @marcus6n. How is the real world audio-transcription data set curation going? It's just 1h of audios, do you think you can finish it today or on Thursday?

lfcnassif commented 2 months ago

Started the evaluation on the 1h real world non public audio data set yesterday, thanks @marcus6n for double checking the transcriptions and @wladimirleite for sending half of them! Preliminary results below (averages still not updated): image