huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.48k stars 26.66k forks source link

Crash on google colab #20526

Closed GoldDRoge closed 1 year ago

GoldDRoge commented 1 year ago

System Info

google colab transformers==4.20.0 https://github.com/kpu/kenlm/archive/master.zip pyctcdecode==0.4.0

Who can help?

No response

Information

Tasks

Reproduction

from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
from transformers import Wav2Vec2ProcessorWithLM
from IPython.lib.display import Audio
import torchaudio
import torch

# Load model & processor
model_name = "nguyenvulebinh/wav2vec2-large-vi-vlsp2020"
model = SourceFileLoader("model", cached_path(hf_bucket_url(model_name,filename="model_handling.py"))).load_module().Wav2Vec2ForCTC.from_pretrained(model_name)
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)

# Load an example audio (16k)
audio, sample_rate = torchaudio.load(cached_path(hf_bucket_url(model_name, filename="t2_0000006682.wav")))

input_data = processor.feature_extractor(audio[0], sampling_rate=16000)
# Infer
output = model(**input_data)

# Output transcript without LM
print(processor.tokenizer.decode(output.logits.argmax(dim=-1)[0].detach().cpu().numpy()))

# Output transcript with LM
print(processor.decode(output.logits.cpu().detach().numpy()[0], beam_width=100).text)

Expected behavior

When ever i run this code input_data = processor.feature_extractor(audio[0], sampling_rate=16000) google colab restart for unknown reason. I really dont know is that a conflict by cpu and gpu???

sgugger commented 1 year ago

cc @sanchit-gandhi

sanchit-gandhi commented 1 year ago

Hey @GoldDRoge! So the issue lies with the processor.feature_extractor call method?

Could you provide a Google Colab link / reproducible code snippet I can run to get this error?

Looks like you're using local audio data. For the shared Colab link / reproducible code snippet, you can use this audio sample:

!pip install datasets

from datasets import load_dataset

librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

sample = librispeech_dummy[0]["audio"]
audio = sample["array"]
sampling_rate = sample["sampling_rate]
GoldDRoge commented 1 year ago

Thanks for quickly response here the link https://colab.research.google.com/drive/1UdedI76aBEMCqlLcj1uakIdRAoztrmg5?usp=sharing
ok let me try. thanks for your help @sanchit-gandhi

GoldDRoge commented 1 year ago

i have try like u suggest but it still crash when ever i run input_data = processor.feature_extractor(audio[0], sampling_rate=16000) hmmm i really dont know what error is that. @sanchit-gandhi

sanchit-gandhi commented 1 year ago

Hey @GoldDRoge! Sorry for the late reply! I was able to reproduce the error with your Google Colab. However, installing the latest version of transformers and pyctcdecode remedies the issue for me: https://colab.research.google.com/drive/1Za4340oWO5GMLlKvgEtvFO8vWVS4Fafy?usp=sharing

Could you try pip installing the latest version of transformers and pyctcdecode as highlighted? Let me know if the issue still persists!

There is a 'warning' that is presented when using your Wav2Vec2ProcessorWithLM that is not present with the 'official' processor from the blog post:

WARNING:pyctcdecode.language_model:Only 0 unigrams passed as vocabulary. Is this small or artificial data?

Could you double check that your KenLM is built correctly? It's quite strange behaviour for the unigrams.txt file to be empty in the KenLM! This means that only sub-word tokens form your LM. https://huggingface.co/nguyenvulebinh/wav2vec2-large-vi-vlsp2020/tree/main/language_model

sanchit-gandhi commented 1 year ago

Hey @GoldDRoge! Did updating to the latest version of transformers and pyctcdecode help with the issue?We should definitely verify that our KenLM is built correctly and is returning a non-zero list of unigrams! Let me know if you're encountering any problems running the updated code snippet, more than happy to help here! 🤗

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.