Open Atticus1806 opened 11 months ago
@sanchit-gandhi Yes, I think the code base needs to be updated. Even when I tried with the latest transformers, it still has the above problem. Ideally, it should work for higher batch size as well as the chunking is done internally.
When trying evaluations, I also got the error:
"#ValueError: Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing language='...'
or make sure all input audio is of the same language.
"
I had to specify the language manually, which we don't need to do in whisper models. I also don't understand why the language detection was so inaccurate and raised this error.
I am also getting frequent disconnected during evaluations,: ".../open_asr_eval/lib/python3.11/site-packages/datasets/download/streaming_download_manager.py", line 351, in read_with_retries
I am also getting the same error in 4 of the datasets. This is when running using the open_asr_leaderboard/transformers/run_whisper.sh
script.
I am trying to reproduce the whisper results for TEDLIUM with the provided
run_whisper.sh
script.What I noticed is, that for TEDLIUM when I run e.g. whisper tiny the model crashes at some point due to:
RuntimeError: The expanded size of the tensor (3000) must match the existing size (3254) at non-singleton dimension 1. Target sizes: [80, 3000]. Tensor sizes: [80, 3254]
I assume this is because the sequence is longer than 30 seconds, but for some reason with batching this can't be handled. While this is probably a whisper side problem I am wondering how was the evaluation done here? Was it reduced to
batch_size=1
? Then the evaluation works, but the WER I receive is off by 0.03.How can this happen, since the other evaluations I ran this far for other datasets matched.