Local Audio Transcription

sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.

Other

990 stars 221 forks source link

Local Audio Transcription #248

Closed rafael844 closed 3 years ago

rafael844 commented 4 years ago

Is there any intention in use some local audio transcription application? Such as cmusphinx / Vosk ? https://cmusphinx.github.io/ https://github.com/alphacep/vosk-api

lfcnassif commented 4 years ago

Yes, we plan to add an implementation using Deep Speech: https://github.com/mozilla/DeepSpeech

But we want to train a Portuguese model first, that will need some work and a lot of training data...

rafael844 commented 4 years ago

Thank you.

lfcnassif commented 4 years ago

Let's leave opened to be tracked.

rafael844 commented 3 years ago

There is also an wav2letter api, from Facebook. https://github.com/facebookresearch/wav2letter

lfcnassif commented 3 years ago

I didn't see vosk already has a portuguese model and java binding. Has anyone tested that model or other language accuracy?

lfcnassif commented 3 years ago

Just pushed an initial local transcritption code using Vosk. Results are ok to me for an initial impl. Vosk model should be put into iped models folder.

lfcnassif commented 3 years ago

Just merged the initial experimental implementation. Possible future improvements:

build the library with CUDA support enabled and test performance on GPU;
calibrate the minWordScore parameter even more;
instead of just filtering out words with score < minWordScore, maybe take into account the score difference to the word candidate with the 2nd highest score (what about merged or splited words?)
train a new portuguese model and compare accuracy and speed;

hauck-jvsh commented 3 years ago

Link present in the Vosk page to a project to train portuguese modelos. https://github.com/falabrasil/kaldi-br

lfcnassif commented 2 years ago

Vosk project published a new big 1.6GB portuguese model and also accuracy numbers for this new and the old model, unfortunately WER is very low (as we have thought):

That also suggests the accuracy issue is related to the trained pt-BR model, not to the algorithm, since english numbers seem much better (of course, tested on different datasets):

wladimirleite commented 2 years ago

That also suggests the accuracy issue is related to the trained pt-BR model, not with the algorithm, since english numbers seem much better (of course, tested on different datasets):

But the new (bigger) model accuracy is better than the one we had before, right? I guess it is too large to be distributed, but users may want to download and use this new model.

lfcnassif commented 2 years ago

But the new (bigger) model accuracy is better than the one we had before, right?

Yes! But probably slower too...

rotivrotiv commented 2 years ago

O projeto Vosk publicou um novo grande modelo português de 1,6 GB e também números de precisão para este novo e o antigo modelo, infelizmente o WER é muito baixo (como pensávamos):

Isso também sugere que o problema de precisão está relacionado ao modelo pt-BR treinado, não ao algoritmo, já que os números em inglês parecem muito melhores (claro, testados em diferentes conjuntos de dados):

Amigo, como posso usar este arquivo de 1,6 GB no IPED? Estou tentando, porém, vejo que o IPED está utilizando a versão small (para android?) e quando baixo o arquivo maior ele vem com pastas diferentes, apesar de conter quase os mesmos arquivos.

Obrigaod!

lfcnassif commented 2 years ago

I think you just have to replace the portuguese model into iped-4.0.0/models/vosk/pt-BR folder for the new one, but I have not tested yet. I can test after I return from vacation next week.

lfcnassif commented 2 years ago

I think you just have to replace the portuguese model into iped-4.0.0/models/vosk/pt-BR folder for the new one, but I have not tested yet. I can test after I return from vacation next week.

Just tested the big 1.6GB model on a small 300 audios dataset. After replacing the model folder contents, you also must remove the "rescore" folder (not present in the english model, so I guessed it wasn't needed). I'm not sure about the side effects, but this fixes a java.lang.Error: Invalid memory access while loading the model with current used vosk-0.3.32 version and a java.io.IOException: Failed to create a model with vosk-0.3.38

The good news is that the huge model took the same time as the small model to transcribe that data set: 79s on a 48 threads CPU.

lfcnassif commented 2 years ago

The good news is that the huge model took the same time as the small model to transcribe that data set: 78s on a 48 threads CPU.

That is wrong, my fault. Actually the big model is, surprisingly, faster than the small model (running times for a small 301 audios data set):

vosk-0.3.32: small model took 95s - big model took 71s
vosk-0.3.38: small model took 126s - big model took 79s

I'll run on a larger data set tomorrow to confirm that observation.

lfcnassif commented 2 years ago

Running time with a ~1700 audios data set:

vosk-0.32.3: small model took 1359s - big model took 1066s
vosk-0.3.38: small model took 2081s - big model took 1692s

But, a manual and informal accuracy comparison made by a colleague of mine seems to show the big model has worse accuracy than the small model with some real case audios. Maybe the big model has a higher bias towards the data sets used for training, and generalize worse to different data sets...

lfcnassif commented 2 years ago

Great paper from 1 year ago, about a wav2vec-2.0 trained model for pt-BR and references to 470 hours of Portuguese data sets! https://arxiv.org/abs/2107.11414

lfcnassif commented 2 years ago

Excerpt from their conclusion:

On average, our
model obtained 10.5% and 12.4% of WER against the proposed test sets, with
and without a language model, respectively. According to our knowledge, the
model achieves state-of-the-art results among the open available E2E models for
the target language.

lfcnassif commented 2 years ago

Author's repo: https://github.com/lucasgris/wav2vec4bp

Seems they made models public available :-)

lfcnassif commented 2 years ago

Opened #1214 for those interested.