Closed rafael844 closed 3 years ago
Yes, we plan to add an implementation using Deep Speech: https://github.com/mozilla/DeepSpeech
But we want to train a Portuguese model first, that will need some work and a lot of training data...
Thank you.
Let's leave opened to be tracked.
There is also an wav2letter api, from Facebook. https://github.com/facebookresearch/wav2letter
I didn't see vosk already has a portuguese model and java binding. Has anyone tested that model or other language accuracy?
Just pushed an initial local transcritption code using Vosk. Results are ok to me for an initial impl. Vosk model should be put into iped models folder.
Just merged the initial experimental implementation. Possible future improvements:
Link present in the Vosk page to a project to train portuguese modelos. https://github.com/falabrasil/kaldi-br
Vosk project published a new big 1.6GB portuguese model and also accuracy numbers for this new and the old model, unfortunately WER is very low (as we have thought):
That also suggests the accuracy issue is related to the trained pt-BR model, not to the algorithm, since english numbers seem much better (of course, tested on different datasets):
That also suggests the accuracy issue is related to the trained pt-BR model, not with the algorithm, since english numbers seem much better (of course, tested on different datasets):
But the new (bigger) model accuracy is better than the one we had before, right? I guess it is too large to be distributed, but users may want to download and use this new model.
But the new (bigger) model accuracy is better than the one we had before, right?
Yes! But probably slower too...
O projeto Vosk publicou um novo grande modelo português de 1,6 GB e também números de precisão para este novo e o antigo modelo, infelizmente o WER é muito baixo (como pensávamos):
Isso também sugere que o problema de precisão está relacionado ao modelo pt-BR treinado, não ao algoritmo, já que os números em inglês parecem muito melhores (claro, testados em diferentes conjuntos de dados):
Amigo, como posso usar este arquivo de 1,6 GB no IPED? Estou tentando, porém, vejo que o IPED está utilizando a versão small (para android?) e quando baixo o arquivo maior ele vem com pastas diferentes, apesar de conter quase os mesmos arquivos.
Obrigaod!
I think you just have to replace the portuguese model into iped-4.0.0/models/vosk/pt-BR folder for the new one, but I have not tested yet. I can test after I return from vacation next week.
I think you just have to replace the portuguese model into iped-4.0.0/models/vosk/pt-BR folder for the new one, but I have not tested yet. I can test after I return from vacation next week.
Just tested the big 1.6GB model on a small 300 audios dataset. After replacing the model folder contents, you also must remove the "rescore" folder (not present in the english model, so I guessed it wasn't needed). I'm not sure about the side effects, but this fixes a java.lang.Error: Invalid memory access
while loading the model with current used vosk-0.3.32 version and a java.io.IOException: Failed to create a model
with vosk-0.3.38
The good news is that the huge model took the same time as the small model to transcribe that data set: 79s on a 48 threads CPU.
The good news is that the huge model took the same time as the small model to transcribe that data set: 78s on a 48 threads CPU.
That is wrong, my fault. Actually the big model is, surprisingly, faster than the small model (running times for a small 301 audios data set):
I'll run on a larger data set tomorrow to confirm that observation.
Running time with a ~1700 audios data set:
But, a manual and informal accuracy comparison made by a colleague of mine seems to show the big model has worse accuracy than the small model with some real case audios. Maybe the big model has a higher bias towards the data sets used for training, and generalize worse to different data sets...
Great paper from 1 year ago, about a wav2vec-2.0 trained model for pt-BR and references to 470 hours of Portuguese data sets! https://arxiv.org/abs/2107.11414
Excerpt from their conclusion:
On average, our
model obtained 10.5% and 12.4% of WER against the proposed test sets, with
and without a language model, respectively. According to our knowledge, the
model achieves state-of-the-art results among the open available E2E models for
the target language.
Author's repo: https://github.com/lucasgris/wav2vec4bp
Seems they made models public available :-)
Opened #1214 for those interested.
Is there any intention in use some local audio transcription application? Such as cmusphinx / Vosk ? https://cmusphinx.github.io/ https://github.com/alphacep/vosk-api