sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.
Other
886 stars 211 forks source link

Evaluate Whisper transcription algorithm #1335

Open lfcnassif opened 1 year ago

lfcnassif commented 1 year ago

Recently made public: https://openai.com/blog/whisper/ https://github.com/openai/whisper

Interesting, they have some multilingual models that can be used for multiple languages without fine tuning for each language. They claim their models generalize better than models that need fine tuning, like wav2vec. Some numbers on Fleurs dataset (e.g. 4.8% WER on Portuguese subset): https://github.com/openai/whisper#available-models-and-languages

lfcnassif commented 1 year ago

Preliminary result of the largest Whisper model on TEDx pt-BR dataset resulted in 20,6% WER. Numbers for other models here: https://github.com/sepinf-inc/IPED/wiki/User-Manual#wav2vec2

The largest whisper model is more than 1 order of magnitude slower than wav2vec2 w/ 1B params on RTX 3090, so it is not usable in practice. Maybe one of the smaller whisper models could have reasonable accuracy and speed.

lfcnassif commented 1 year ago

I tried to transcribe ~10h of audios using the largest whisper model on RTX 3090, the estimated time to finish was 4 days, so I aborted the test, it is not feasible to use in practice. Current wav2vec2 algorithm with 1B params took about 22min to transcribe ~29h of audios using 3 RTX 3090 (in 2 nodes), so the largest whisper model is more than 2 orders of magnitude slower than what we have today.

I'll try their smallest model (36x faster) to see how the accuracy is on the test datasets.

rafael844 commented 1 year ago

Hi, is there a way we can test whisper with IPED? Is there a snapshot with it so we could use?

lfcnassif commented 1 year ago

I think I didn't push the POC implementation, the 250x time cost comparing to wav2vec2 made me very acceptic to use whisper in production. I didn't test their smaller model yet, but maybe the accuracy will drop a lot.

If you really would like to try, it is easy to change script below with whisper example code in their github main page: https://github.com/sepinf-inc/IPED/blob/master/iped-app/resources/scripts/tasks/Wav2Vec2Process.py

lfcnassif commented 1 year ago

Their smaller model should still be 7x slower than wav2vec2 according to my tests and their published relative model costs.

rafael844 commented 1 year ago

Thanks @lfcnassif , I don't know how to program very well but I'll see if a colleague can help me. This was a request from my superiors.

lfcnassif commented 1 year ago

Hi @rafael844,

I just found the multilanguage (crazy to me!) whisper models on huggingface: https://huggingface.co/openai/whisper-large-v2

So maybe you just need to set huggingFaceModel parameter in conf/AudioTranscriptConfig.txt to openai/whisper-large-v2 in IPED 4.1 (I didn't test it).

Jonatas Grosman also fine tunned that multilanguage model to portuguese (https://huggingface.co/jonatasgrosman/whisper-large-pt-cv11), although it is not a need, so you can also try jonatasgrosman/whisper-large-pt-cv11 if above doesn't work.

But I warn you, my past tests resulted in 250x slowdown comparing to wav2vec2. That large whisper model accuracy seems to be better and also have punctuation and caps, but I don't think the 250x cost is worth to pay on scale.

You may try smaller whisper models, but accuracy should drop down: https://huggingface.co/openai

lfcnassif commented 1 year ago

Just tested, it doesn't work out of the box, needs code changes.

rafael844 commented 1 year ago

Thank you @lfcnassif .

Ill take a look. But with my lack of programing skills and with those results, we Will keep with wav2vec2 and the already models. It would be nice have ponctuation and caps, but as you said, 250x is not worth.

Wav2vec2 do a good job, even with our cheap and weak gpu, we can spread the job in multiple machines, wich is great, and the results are good so far.

lfcnassif commented 1 year ago

You are welcome!

lfcnassif commented 1 year ago

Now they have a service: https://techcrunch.com/2023/03/01/openai-debuts-whisper-api-for-text-to-speech-transcription-and-translation/

lfcnassif commented 1 year ago

Price is 1/3 comparing to Microsoft/Google.

MariasStory commented 1 year ago

Try whisper.cpp.

lfcnassif commented 1 year ago

Try whisper.cpp.

Thanks for pointing. Unfortunately they don't support GPU and transcribing a 4s audio on a 48 threads CPU took 32s using the medium size model in a first run/test (the large model should be 2x slower). Strangely the second run took 73s and a third run took 132s...

MariasStory commented 1 year ago

Strange. On my Ubuntu linux, in a docker container the compiled whisper.cpp ./main runs the large model ~2.9 GB on 4 CPU cores at about 4 time the recorded time. The small model runs at about real time.

Create and use image for running with docker.

docker run --name whisper -it -v $(pwd)/:/host python /bin/bash -c "git clone https://github.com/ggerganov/whisper.cpp.git && cd whisper.cpp && make large && mv models/ggml-large.bin /host/"
docker commit whisper whisper:latest
docker rm whisper
ffmpeg -y -i test1.mp3 -ar 16000 -ac 1 -c:a pcm_s16le temp.wav && sudo docker run -it --rm --name whisper -v $(pwd)/:/host --network none whisper /whisper.cpp/main -m /host/ggml-large.bin -f /host/temp.wav -l de -oj -of /host/test1
lfcnassif commented 1 year ago

Another optimized implementation to be tested, they say it is 4x times faster than the original OpenAI model on the GPU: https://github.com/guillaumekln/faster-whisper

lfcnassif commented 1 year ago

https://github.com/guillaumekln/faster-whisper

The project claims to transcribe 13min audio in ~1min using Tesla V100S (an older GPU than ours), that's just ~3x slower than the 1B parameters wav2vec2 model used by us on RTX 3090. Given the 4.5x speed up reached by them, that is incompatible to my past tests that have shown a 250x slowdown when switching from 1B wav2vec2 to whisper large model, I'll try to run again the performance tests...

lfcnassif commented 1 year ago

Another promising one: https://github.com/sanchit-gandhi/whisper-jax

By processing audios in batches + TPUs it can give up to 70x speed up.

DHoelz commented 11 months ago

Hi @nassif

We’ve been trying to use wav2vec2 to transcribe our audios but the results we were getting was a bit disappointing as often the transcription was barely readable, especially if compared to Azure (which we don’t have a contract with).

For that reason, we looked for other options and found OpenAi’s Whisper project.

Although slower than wav2vec the results were A LOT better, comparable to Azure’s transcription.

For our tests I tried Whisper and Faster-Whisper implementations (and probably will try Whisper-JAX latter, although we don’t have a graphic card with TPU). The model used was pierreguillou/whisper-medium-portuguese, that according to the author has a WER of only 6.5987.

The tests were done on a HP Z4 with a Xeon W-2175, 64 GB of RAM and a QUADRO P4000. The audio sample has only 42 seconds.

Wav2Vec2: 3,7 s eu sei peu tenho ciência disso e...eu sei peu tenho ciência disso e você sabe dasse toda todo então eu stôu correndo é pra passar dinheiro pra você eu não tenho conversa pra você tedo é passar dinheiro eu tenho que passando dinheiro passando dinheiro passando dinheiro isso aí é porque go de falei ter uma coisa pra pra ser resolvido até acho que até o final do mês resolve não é nada grande não mas aí o cara já vai me adiantar e prca meno de mil e quinhentos e dois mil e depois eu passo pra ele mas eu sto te passando mil quinhentos e dois mil e pretendo passar mais aí logo logo tá stô falando é é logo logo mesmo não vai parar por aí não só me falei qual que é o bancoe

Whisper: 23,02 s Eu sei, eu tenho ciência disso e você sabe da história toda. Eu tô correndo é para passar dinheiro para você, eu não tenho conversa para você, entendeu? é passar dinheiro, eu tenho que ir passando dinheiro, passando dinheiro, passando dinheiro. Isso aí é porque, como eu te falei, tem uma coisa para ser resolvida até o final do mês resolve? não é nada grande não, mas aí o cara já vai me adiantar a pelo menos mil e quinhentos dois mil e depois eu passo para ele. Mas eu tô te passando mil e quinhentos dois mil e pretendo passar mais aí logo logo, tô falando é logo logo mesmo, não vai parar por aí não, só me fale qual é o banquinho.

Faster-Whipser: 8,59 s Eu sei, eu tenho ciência disso. E você sabe da história toda, então eu tô correndo é para passar dinheiro para você, eu não tenho conversa para você, entendeu? é para passar dinheiro, eu tenho que ir passando dinheiro, passando dinheiro, passando dinheiro, isso aí é porque, como eu te falei, tem uma coisa para ser resolvida até, acho que até o final do mês resolve, não é nada grande não, mas aí o cara já vai me adiantar, pelo menos meia quinhentos, dois mil e depois eu vou por ele, mas eu tô te passando meia quinhentos, dois mil e pretendo passar mais aí logo, logo, tô falando é logo, logo mesmo, não vai parar por aí não, só me fale qual é o banquê.

Azure: Eu sei, pô, tenho ciência disso....Eu sei, pô, tenho ciência disso. E você sabe da história toda, então. Eu tô correndo é pra passar dinheiro para você. Eu não tenho conversa pra você, entendeu? É passar dinheiro, eu tenho que passar no dinheiro, passando dinheiro, passando dinheiro, isso aí é porque? Gosto de falei, tem uma coisa para para ser resolvida. Até acho que até o final do mês resolve. Não é nada grande, não, mas aí o cara já vai me adiantar aí pelo −1500 2000 e depois eu passo para ele. Mas eu estou te passando 1502 1000 e pretendo passar mais aí logo logo, tá, tô falando, é. É logo logo mesmo, não vai parar por aí não. Só me fala aí, qual que é o banco aí?

It would be nice to have Whisper as an option to use with IPED as it's free, runs locally (no need to send data to the cloud), has punctuation (which makes reading considerably better) and the results are comparable to Azure’s service.

lfcnassif commented 11 months ago

We’ve been trying to use wav2vec2 to transcribe our audios but the results we were getting was a bit disappointing

What model have you used? Have you used jonatasgrosman/wav2vec2-xls-r-1b-portuguese ?

Although slower than wav2vec the results were A LOT better, comparable to Azure’s transcription.

Have you measured WER on your data set? How many audios do you have, what is the total duration? If you can help to compare whisper models properly to wav2vec2 models, I can send you the public data sets used in this study: https://user-images.githubusercontent.com/7276994/183307766-cec85345-bd28-44a8-91ec-20451ff50f19.png

The model used was pierreguillou/whisper-medium-portuguese, that according to the author has a WER of only 6.5987.

On what data set?

QUADRO P4000.

So you have a GPU without TPUs, right?

The audio sample has only 42 seconds.

Well, I think it is not enough to represent the variability we can find in seized data sets... Anyway, have you computed WER on this 42 seconds audio so we can also have an objective measure instead of just feelings (which is also important).

has punctuation (which makes reading considerably better)

I understand this is an advantage not counted by traditional WER...

DHoelz commented 11 months ago

What model have you used? Have you used jonatasgrosman/wav2vec2-xls-r-1b-portuguese ?

We tried both large and small models (from jonatasgrosman and edresson). They had similiar results but the large one took a lot longer to transcript.

Have you measured WER on your data set? How many audios do you have, what is the total duration? If you can help to compare whisper models properly to wav2vec2 models, I can send you the public data sets used in this study:

We didn't measure the WER. That value was informed by the model owner. https://huggingface.co/pierreguillou/whisper-medium-portuguese

Our tests were made using 6000 audios from an exam that roughly adds to 1000 minutes (or 16,9 hours). All audios were all transcribed using wav2vec2 (large model), azure, whisper and faster-whisper. Using whisper took around 11 hours and with faster-whisper around 5 hours. The 42 seconds audio transcription I sent was only an example of how better whisper is compared to wav2vec, and how close it can be to Azure’s Service.

Based on every test we made with wav2vec (on several exams), we concluded that it was simply better not to send the transcriptions, as many of the times they were unreadable.

And as a notice, I probably didn’t implement whisper and faster-whisper in the best way possible, meaning that there is probably room for improvement in speed.

About the dataset, I could try testing it. I don't know how these datasets are made and if they include the "kind" of audio we normally need to transcribe. Let’s say it’s a multitude of forms of Portuguese.

On what data set?

According to the author: common_voice_11_0 dataset

So you have a GPU without TPUs, right?

Yes, no TPU here.

Well, I think it is not enough to represent the variability we can find in seized data sets... Anyway, have you computed WER on this 42 seconds audio so we can also have an objective measure instead of just feelings (which is also important).

As I said before, unfortunately just feelings. But the general feeling here is that it's way better :-)

leosol commented 11 months ago

This feature would be very interesting for those who do not have a contract for audio transcription with third parties (which I believe is the majority of Brazilian states). Even though it might be a time-consuming process, often the transcription ends up being included in the body of the report, and something more accurate would be very welcome, especially since in most frequent cases the transcription is done on a small set of chats.

lfcnassif commented 11 months ago

Hi @DHoelz and @leosol, thanks for your contributions to the project.

We tried both large and small models (from jonatasgrosman and edresson). They had similiar results

Well, looking at the numbers of the tests I referenced, I think 25% less errors are a reasonable difference. Of course this can change depending on the data set...

The 42 seconds audio transcription I sent was only an example of how better whisper is compared to wav2vec, and how close it can be to Azure’s Service.

It looks better for this audio, but without the gold standard, I can't come to any scientific conclusion about which model is better. I also refer to Whisper, Faster-Whisper, Whisper-JAX, which is better? Please also notice there is an open enhancement for wav2vec2 #1312 to avoid wrong words (out of vocabulary).

Based on every test we made with wav2vec (on several exams), we concluded that it was simply better not to send the transcriptions, as many of the times they were unreadable.

Well, our users are quite satisfied, of course if we can provide better results in an acceptable response time, that's good, that's the goal of this ticket. How have you tested Wav2vec2, using IPED or externally in a standalone application?

According to the author: common_voice_11_0 dataset

Common Voice cuts are usually easy data sets, CORAA is a much more difficult portuguese one, it would be interesting to evalute the author's model on CORAA.

This feature would be very interesting for those who do not have a contract for audio transcription with third parties

We also don't have a commercial contract here, that's why I integrated Vosk and Wav2Vec2 later.

In summary, this ticket is to evaluate Whisper models using an objective metric on the same data sets we evaluated the other models. We can use a more difficult real world data set, running all models again, if you are able to share the audios and their correct transcriptions validated by humans. If we come to a fundamented conclusion it is better on different data sets without a huge performance penalty (maybe a 2x-3x would be acceptable), I'll add the implementation, when I have available time...

Of course contributions to implement it into IPED are very welcome, please send a PR, I'll be happy to test and review.

leosol commented 11 months ago

Thanks @lfcnassif for you attention! I think that we can work on a PR so that we have this extra option to use with IPED. Our feeling is that this implementation would be very useful. Meanwhile, I guess we might be able to help with the validation of the method using existing datasets and maybe create new ones, focused on our special case... this might require some extra time but for sure would be very helpful. Thanks again

lfcnassif commented 11 months ago

Meanwhile, I guess we might be able to help with the validation of the method using existing datasets and maybe create new ones, focused on our special case... this might require some extra time but for sure would be very helpful.

Thanks @leosol. There is also an open ticket for this: #1313 still not started...

All audios were all transcribed using wav2vec2 (large model), azure, whisper and faster-whisper. Using whisper took around 11 hours and with faster-whisper around 5 hours.

Hi @DHoelz, do you have the transcription time for each Wav2Vec2 model?

About the Whisper transcriptions, were you able to compute the "confidence score" of transcriptions, like we do for Wav2Vec2? I didn't see it in the docs, but I just took a quick look in the past. I think it is an important feature to keep.

lfcnassif commented 11 months ago

I'm running fast-whisper medium model, int8 precision, beam_size=5 on a 48 threads CPU over our ~29h public test set, estimated time to finish is ~1day. Tomorrow, if I have time, I will pause and try to run on RTX3090 GPU.

DHoelz commented 11 months ago

Hi @lfcnassif, sorry for the delay.

Well, looking at the numbers of the tests I referenced, I think 25% less errors are a reasonable difference. Of course this can change depending on the data set...

There's the larger whisper version trained for portuguese also, with an even better WER (jonatasgrosman/whisper-large-pt-cv11).

image

Same example as before: Faster-Whisper (Large Model): 16.66 s

Eu sei, pô, eu tenho ciência disso. E... você sabe da história toda, então... Eu tô correndo é pra passar dinheiro pra você, eu não tenho conversa pra você, entendeu? É passar dinheiro. Eu tenho que ir passando dinheiro, passando dinheiro, passando dinheiro. Isso aí é porque... igual eu te falei, tem uma coisa pra ser resolvida até... acho que até o final do mês resolve? Não é nada grande não, mas aí o cara já vai me adiantar pelo menos um mil e quinhentos, dois mil e depois eu passo pra ele. Mas eu tô te passando um mil e quinhentos, dois mil e pretendo passar mais aí logo logo, tá? Tô falando é... é logo logo mesmo. Não vai parar por aí não. Só me fala aí qual que é o banco aí.

It looks better for this audio, but without the gold standard, I can't come to any scientific conclusion about which model is better. I also refer to Whisper, Faster-Whisper, Whisper-JAX, which is better? Please also notice there is an open enhancement for wav2vec2 #1312 to avoid wrong words (out of vocabulary).

Tried running the TEDx dataset but i'ts too large and I can't occupy my machine for so long right now. I’ll try it again later or maybe see if I can find another unused machine. About whisper, faster-whisper or whisper-jax: whisper seems too slow and with the tests we made comparing to faster-whisper there wasn’t much of a difference in the transcriptions, but the speed was at least 2x faster. About whisper-jax I couldn’t make it work properly so I can’t give you feedback right now.

How have you tested Wav2vec2, using IPED or externally in a standalone application?

All tests with Wav2vec2 where made using IPED. Tests for whisper and faster-whisper were made using a python script (which probably could be a lot better).

Hi @DHoelz, do you have the transcription time for each Wav2Vec2 model?

They were made on another machine. I'll try to fetch them later.

About the Whisper transcriptions, were you able to compute the "confidence score" of transcriptions, like we do for Wav2Vec2?

At first glance I couldn't find a score either. I'll try to search about it.

I'm running fast-whisper medium model, int8 precision, beam_size=5 on a 48 threads CPU over our ~29h public test set, estimated time to finish is ~1day. Tomorrow, if I have time, I will pause and try to run on RTX3090 GPU.

I converted both medium and large models with float16 precision and used beam_size=5 for my tests.

DHoelz commented 11 months ago

Quick update,

I created a small python script for IPED to run faster-whisper on that 6000 (~16,11 hrs) audios sample.

The model used was the medium, with beam_size=5 and float32. It took 4,28 hours to process everything. Running with Wav2vec2 using the large model took a litle bit over 1 hour.

About the score, you can get a word by word score passing the option word_timestamps=true https://github.com/guillaumekln/faster-whisper/issues/52

On that 42 s audio sample, the transcription time raised from 8,59 s to 9.17 to calculate the score (6,75% more time).

I tried to populate the whatsapp chat with the transcriptions but had no success. I created the 2 metadatas (transcription and score) and even created that transcriptions.db. Should I have done something else? (The task is just after the transcriptiontask and before the parsing task).

lfcnassif commented 11 months ago

Hi @DHoelz thanks for the updates.

The processing times I'm getting looks really strange when compared to yours. With Faster-Whisper medium model int8 precision and beam=5 I'm getting roughly 1h of transcription for each hour of audio, on a HP Z8 with 2 Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz giving me a total of 48 logical cores... Maybe Faster-whisper can benefit of some CPU instructions not present in my machine...

Unfortunately I had to stop the test because I had to use my machine for other task and I didn't had time to install the RTX 3090 GPU, will do next week.

Here is a quick and dirty script I changed yesterday to run the test, you can simply replace IPED-4.1.3/scripts/tasks/Wav2Vec2Process.py content and everything should work (except scores), including embedding transcriptions into the chats.

import sys
stdout = sys.stdout
sys.stdout = sys.stderr

terminate = 'terminate_process'
model_loaded = 'wav2vec2_model_loaded'
huggingsound_loaded = 'huggingsound_loaded'
finished = 'transcription_finished'
ping = 'ping'

def main():

    modelName = 'medium'
    #modelName = sys.argv[1]

    deviceNum = sys.argv[2]

    import os
    os.environ["OMP_NUM_THREADS"] = "24"

    from faster_whisper import WhisperModel

    print(huggingsound_loaded, file=stdout, flush=True)

    #import torch
    #cudaCount = torch.cuda.device_count()

    # Run just on CPU for now
    cudaCount = 0

    print(str(cudaCount), file=stdout, flush=True)

    if cudaCount > 0:
        deviceId = 'cuda:' + deviceNum
    else:
        deviceId = 'cpu:'

    try:
        model = WhisperModel(modelName, device=deviceId, compute_type="int8")

    except Exception as e:
        if deviceId != 'cpu':
            # loading on GPU failed (OOM?), try on CPU
            deviceId = 'cpu'
            model = WhisperModel(modelName, device=deviceId, compute_type="int8")
        else:
            raise e

    print(model_loaded, file=stdout, flush=True)
    print(deviceId, file=stdout, flush=True)

    while True:

        line = input()

        if line == terminate:
            break
        if line == ping:
            print(ping, file=stdout, flush=True)
            continue

        transcription = ''
        try:
            segments, info = model.transcribe(audio=line, language='pt', beam_size=5)
            for segment in segments:
                transcription += segment.text

        except Exception as e:
            msg = repr(e).replace('\n', ' ').replace('\r', ' ')
            print(msg, file=stdout, flush=True)
            continue

        text = transcription.replace('\n', ' ').replace('\r', ' ')

        # TODO Compute this correctly
        finalScore = 1

        print(finished, file=stdout, flush=True)
        print(str(finalScore), file=stdout, flush=True)
        print(text, file=stdout, flush=True)

    return

if __name__ == "__main__":
     main()

PS1: Scores are still not being computed above and set to 1 PS2: IPED opens 1 process to run script above for each physical CPU you have, so you should update OMP_NUM_THREADS above accordingly to the number of logical cores per CPU. PS3: I changed the code of Wav2Vec2TranscriptTask, so you need to enable it in AudioTranscriptConfig.txt PS4: Code above needs to install faster-whisper inside IPED's portable python: pip install faster-whisper

lfcnassif commented 11 months ago

The dirty script above should also work for a cluster setup using Wav2Vec2 today (except scores), if anyone is interested.

DHoelz commented 11 months ago

The processing times I'm getting looks really strange when compared to yours. With Faster-Whisper medium model int8 precision and beam=5 I'm getting roughly 1h of transcription for each hour of audio, on a HP Z8 with 2 Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz giving me a total of 48 logical cores... Maybe Faster-whisper can benefit of some CPU instructions not present in my machine

I'm using a Quadro P4000 for the transcriptions (GPU) and not the CPU.

Here is a quick and dirty script I changed yesterday to run the test, you can simply replace IPED-4.1.3/scripts/tasks/Wav2Vec2Process.py content and everything should work, including embedding transcriptions into the chats.

Tried something very similar but I got error with the "ping" so the process would keep getting killed.

I will try it tomorrow. Thanks 👍

PS2: IPED opens 1 process to run script above for each physical CPU you have, so you should update OMP_NUM_THREADS above accordingly to the number of logical cores per CPU. PS3: I changed the code of Wav2Vec2TranscriptTask, so you need to enable it in AudioTranscriptConfig.txt

I will try to make it work with the GPU. If I can't make it work then I'll use the CPUs.

lfcnassif commented 11 months ago

I'm using a Quadro P4000 for the transcriptions (GPU) and not the CPU.

Hum OK, thanks.

lfcnassif commented 11 months ago

Tried something very similar but I got error with the "ping" so the process would keep getting killed.

Just saw my logs, I also got some dozens of the "Fail to ping transcription process pid=XXX" message, and process PID keeps changing. On UI, some audios are with an empty transcription, from thousands. This may mean the python process is eventually crashing because of some bug in faster-whisper implementation, I don't remember about this happening with standard Whisper impl when I tested months ago.

DHoelz commented 11 months ago

With my script I only had one audio with error. Every other audio worked ok.

Tomorrow I'll send you the script.

It's probably very poorly written, but for a first testing it might be ok.

lfcnassif commented 11 months ago

Thanks. Maybe fast-whisper writes to stdout and is breaking the communication with the java process, I'll try to debug code above.

lfcnassif commented 11 months ago

Maybe fast-whisper writes to stdout

Seems it's not the case.

I also noticed some audios with poor quality were transcribed to the wrong language. I just updated the script above to use 'pt' language.

lfcnassif commented 11 months ago

PS: I'm using the default 'medium' model, not a specific portuguese one for now. Whisper should work for many languages without fine tuning according to original paper, which is a strong indication it should generalize better to unseen audios.

DHoelz commented 11 months ago

Just added to huggingface the two portuguese models I'm using already converted to CT2, so they can be downlodaded directly from huggingface's website.

Medium: dwhoelz/whisper-medium-pt-ct2 https://huggingface.co/dwhoelz/whisper-medium-pt-ct2/tree/main

Large: dwhoelz/whisper-large-pt-cv11-ct2 https://huggingface.co/dwhoelz/whisper-large-pt-cv11-ct2

If it says that it only accepts models named: tiny, small..... update your faster-whisper as this seems to have been implemented a few weeks ago.

lfcnassif commented 11 months ago

I just updated the script above to use 'pt' language.

After this change, the "ping" error decreased a lot, but still happened 14 times after processing 10k audios. The run will finish in 18h, so I will be able to compute WER for all 7 public portuguese data sets I have been using to test different models. After I install the RTX 3090 next week, I should be able to test other whisper implementations/models faster.

lfcnassif commented 11 months ago

PS: I'm using the default 'medium' model, not a specific portuguese one for now. Whisper should work for many languages without fine tuning according to original paper, which is a strong indication it should generalize better to unseen audios.

My concern about using some portuguese fine tuned version is tuning it to a specific data set (like Common Voice), affecting the generalization of the model to unseen audios. But after installing the GPU I should be able to evaluate portuguese tuned models too.

lfcnassif commented 11 months ago

For example, see Jonatas Grosman's whisper model (fine tuned on Common Voice 11) evaluated on Fleurs data set, not used for training, in some evaluation scenarios the result was worse than original whisper model: image

lfcnassif commented 11 months ago

Here are the WER results of Fast-Whisper with default medium model, int8 precision and beam=5: image

It was the best model until now on Lapsbm, MLS and TEDx, the last is a difficult one. But the accuracy on CORAA, the most difficult test set I have, was worse than others. Since CORAA is a big data set, that caused the weighted WER to increase a lot...

To compute WER, I converted all chars to lowercase and removed all punctuation from Whisper results, since they are not present in the "trusted" transcriptions. Whisper also returns number symbols instead of their long written form, that affects WER computation on the sets I have. I converted Whisper number results to their long written form on Common Voice results and that decreased WER by 0.005, not that much. I'll try to do the same for CORAA results to check if it gives a significant difference.

My planned next steps:

If any model above gives good WER results, I'll try to run and evaluate Whisper-JAX

I also want to revisit #1312, it can bring 2%-3% WER improvement to Wav2Vec2 based on what I had read.

PS: Tests 3 to 8 will be run with float precision

lfcnassif commented 11 months ago

I'm not sure what is causing the "Fail to ping" error, but after putting some printStackTraces, it is being always throw by this flush: https://github.com/sepinf-inc/IPED/blob/ffbcac00d20f83d5e1346a38ef47c9e56b181f19/iped-engine/src/main/java/iped/engine/task/transcript/Wav2Vec2TranscriptTask.java#L209

The write before works. Using Wav2Vec2 algorithm, the ping error doesn't happen to me.

lfcnassif commented 11 months ago
  1. Run Whisper default implementation with the TINY model to check PING errors.

Ping errors also happen with standard Whisper impl.

lfcnassif commented 11 months ago

Results of Faster-Whisper with default medium model, float16 precision and beam_size=5: image

I put back Common Voice original WER until I normalize numbers on all datasets trusted transcriptions to Whisper expected outputs.

I installed one RTX 3090 in my machine earlier. Running time decreased from ~30h on dual CPU to 3.5h. Fast-Whisper + medium model is still ~10 times slower than Wav2Vec2 1 billion parameters model.

Now running Fast-Whisper + pierreguillou/whisper-medium-portuguese converted to CTranslate2 by @DHoelz...

lfcnassif commented 11 months ago

Results of Fast-Whisper on pierreguillou medium pt model: image

Running Faster-Whisper with default large-v2 model...

lfcnassif commented 11 months ago

Taking a closer look at results on SID data set, number format are affecting WER a lot, SID has many numbers in the transcriptions. PierreGuillou's model improved over the previous 2 because it gives long number descriptions many times, mixed with number symbols. Previous 2 gives just number symbols. Normalizing numbers should make a big difference on SID.

lfcnassif commented 11 months ago

Taking a closer look at results on SID data set, number format are affecting WER a lot, SID has many numbers in the transcriptions. PierreGuillou's model improved over the previous 2 because it gives long number descriptions most times, mixed with number symbols. Previous 2 gives just number symbols. Normalizing numbers should make a big difference on SID.

The same thing explains why Pierre Guillou's model performed very well on CommonVoice set.

lfcnassif commented 11 months ago

WER Results after running all planned models/implementations on the data sets I collected, lower is better: image

Observations so far:

  1. Yellow background means the model used the data set training slice for training;
  2. Jonatas Grosman's Whisper fine tuned model on Common Voice 11 PT slice was the best model so far. It was the best on 5 from 7 sets. It improved a lot on many sets comparing to his own Wav2Vec2 large fine tuned model to portuguese, except on CORAA set. But his Wav2Vec2 model used CORAA for training while the Whisper fine tuned one didn't. Weighted average improved just by ~2% (probably because of CORAA, a big set), but the Average improved a lot because it was better on most sets, although some are very small, that may point to a better generalization;
  3. Results of Whisper and Faster-Whisper using the same large-v2 model were quite different, I don't know how to explain that;
  4. Some models performed bad on SID comparing to others because of a lot of numbers into that data set and because some models output number digits and others the number written text form. I plan to normalize number outputs to the same pattern and compute WER again;
  5. Pierre Guillou's portuguese model gave a non consistent number output, sometimes giving digits, sometimes text form, that may mean a non ideal fine tuning. It was also quite worse than whisper default medium model on MLS and TEDx (better on SID, but this was explained above). So I personally would use Whisper default medium model instead of Pierre Guillou's;

PS: I reviewed some previous WER computations because of wrong punctuation removal done before affecting results.

DHoelz commented 11 months ago

Hi @lfcnassif

Great to see you finished the tests.

There are a few posts on faster-whisper's Github page about differences in transcriptions between faster-whisper and whisper. I believe the one that is more related to your findings would be:

https://github.com/guillaumekln/faster-whisper/issues/10

(And this one also seems to be somewhat useful https://github.com/guillaumekln/faster-whisper/issues/280)

About Pierre Guillou's portuguese model, it's the most downloaded one, but there are a few others that were also trained with Common Voice, but with small differences. I really don't know how much difference that would make. They seem to have very close WERs.