Closed GoldDRoge closed 1 year ago
cc @sanchit-gandhi
Hey @GoldDRoge! So the issue lies with the processor.feature_extractor
call method?
Could you provide a Google Colab link / reproducible code snippet I can run to get this error?
Looks like you're using local audio data. For the shared Colab link / reproducible code snippet, you can use this audio sample:
!pip install datasets
from datasets import load_dataset
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = librispeech_dummy[0]["audio"]
audio = sample["array"]
sampling_rate = sample["sampling_rate]
Thanks for quickly response here the link https://colab.research.google.com/drive/1UdedI76aBEMCqlLcj1uakIdRAoztrmg5?usp=sharing
ok let me try. thanks for your help
@sanchit-gandhi
i have try like u suggest but it still crash when ever i run input_data = processor.feature_extractor(audio[0], sampling_rate=16000) hmmm i really dont know what error is that. @sanchit-gandhi
Hey @GoldDRoge! Sorry for the late reply! I was able to reproduce the error with your Google Colab. However, installing the latest version of transformers and pyctcdecode remedies the issue for me: https://colab.research.google.com/drive/1Za4340oWO5GMLlKvgEtvFO8vWVS4Fafy?usp=sharing
Could you try pip installing the latest version of transformers and pyctcdecode as highlighted? Let me know if the issue still persists!
There is a 'warning' that is presented when using your Wav2Vec2ProcessorWithLM that is not present with the 'official' processor from the blog post:
WARNING:pyctcdecode.language_model:Only 0 unigrams passed as vocabulary. Is this small or artificial data?
Could you double check that your KenLM is built correctly? It's quite strange behaviour for the unigrams.txt
file to be empty in the KenLM! This means that only sub-word tokens form your LM. https://huggingface.co/nguyenvulebinh/wav2vec2-large-vi-vlsp2020/tree/main/language_model
Hey @GoldDRoge! Did updating to the latest version of transformers and pyctcdecode help with the issue?We should definitely verify that our KenLM is built correctly and is returning a non-zero list of unigrams! Let me know if you're encountering any problems running the updated code snippet, more than happy to help here! 🤗
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
google colab transformers==4.20.0 https://github.com/kpu/kenlm/archive/master.zip pyctcdecode==0.4.0
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
When ever i run this code input_data = processor.feature_extractor(audio[0], sampling_rate=16000) google colab restart for unknown reason. I really dont know is that a conflict by cpu and gpu???