jonatasgrosman / huggingsound

HuggingSound: A toolkit for speech-related tasks based on Hugging Face's tools
MIT License
430 stars 42 forks source link

Possible issue when using HuggingFace portuguese language model #62

Open lfcnassif opened 2 years ago

lfcnassif commented 2 years ago

First, thank you very much for this great project, it makes ASR very easy!

And your models are awesome! I made some accuracy tests with https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese model (https://github.com/sepinf-inc/IPED/issues/1214#issuecomment-1207470644) and it is comparable to Microsoft's and Google's pt-BR models, actually a bit better!

Now I'm trying to use a language model as described in the Readme.md. I'm trying to use the same LM in the language_model folder in the HuggingFace model card above, but it prints some warning in console:

09/02/2022 12:10:19 - WARNING - pyctcdecode.alphabet - Found entries of length > 1 in alphabet. This is unusual unless style is BPE, but the alphabet was not recognized as BPE type. Is this correct?
09/02/2022 12:10:19 - WARNING - pyctcdecode.alphabet - Unigrams and labels don't seem to agree.

WER accuracy also dropped a lot. Am I doing something wrong? What language model is compatible to the above Portuguese model?

Thanks in advance