jonatasgrosman / huggingsound

HuggingSound: A toolkit for speech-related tasks based on Hugging Face's tools
MIT License
429 stars 43 forks source link

Transcriptions have no spaces - wav2vec2-xls-r-1b-spanish #51

Open santideleon opened 1 year ago

santideleon commented 1 year ago

I am working on Speech to Text for ~135 (or less) second audios of spanish recorded by lapel microphons or VR goggles. I am using wav2vec2-xls-r-1b-spanish and the language model lm.binary and unigrams.txt provided. They are the ones downloaded from jonatasgrosman/wav2vec2-large-xlsr-53-spanish, but based on the size they seems to be the exact same for 1b. I originally started with large version, but I opted for 1b for better performance.

My plan is to work on the text with the pysentimiento pre-trained spanish sentiment and emotion analyzer. The problem I have is that the text does not have spaces separating the words.

Is there a quick fix for this or any suggestions?

Example: alesundíamanormalparamímelevantosobrelasochodelamañana desayunasepredesayunoalomismodeayunosquirconceriales yfrutameduchomeevistoacosasenchilavoycaminandosube lacuestahastaelaparadadelautobustyietesperoquevenga autobusesestallevaalaparadadesanlorenzocojoelmetro

code:


model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-spanish")
lm_path = "language_model/lm.binary"
unigrams_path = "language_model/unigrams.txt"
decoder = KenshoLMDecoder(model.token_set, lm_path=lm_path, unigrams_path=unigrams_path)

def process_single_audio(correct_path, sr=16000,):

    #y, sr = librosa.load(str(path+correct_path),sr=sr)
    transcriptions = model.transcribe([str(correct_path)[1:]], decoder=decoder)

    print(transcriptions[0]['transcription'])

    return transcriptions[0]['transcription']
santideleon commented 1 year ago

This problem seems to be fixed by using the automatic-speech-recognition pipeline. With and without chunking. Not really sure what is happening.

code: ` pipe = pipeline("automatic-speech-recognition", model="jonatasgrosman/wav2vec2-xls-r-1b-spanish", tokenizer="jonatasgrosman/wav2vec2-xls-r-1b-spanish", feature_extractor= "jonatasgrosman/wav2vec2-xls-r-1b-spanish", decoder=decoder)

transcriptions = pipe(str(correct_path)[1:]) `

Additionally I tested chunking in the pipeline. My first thought was that there was a problem with the length of the audios, but after testing different chunking parameters and then without chunking, it worked perfectly. The only thing I would note is that chunking significantly increases the time of processing the audio. I saw processing times of twice as long and up to seven times more. In terms of accurately transcribing the audios the longest to compute (of 10s chunks) seemed to work the best, but it is not worth the computation time, since 30s chunks which only doubled the processing time was almost as good.

iljab commented 1 year ago

Same issue using the jonatasgrosman/wav2vec2-large-xlsr-53-german model

arikhalperin commented 1 year ago

You should try to add a language model. See here: https://huggingface.co/blog/wav2vec2-with-ngram

detongz commented 11 months ago

@santideleon Hi and I have this same issue using wbbbbb/wav2vec2-large-chinese-zh-cn model.

Have you solved this problem?