raise RuntimeError("Failed to load audio from {}".format(filepath))

mehrdad78 commented 2 years ago

System Info

i want to run

run_speech_recognition_ctc.py but i got the error when run the Single GPU CTC script. python run_speech_recognition_ctc.py \ --dataset_name="common_voice" \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --dataset_config_name="tr" \ --output_dir="./wav2vec2-common_voice-tr-demo" \ --overwrite_output_dir \ --num_train_epochs="15" \ --per_device_train_batch_size="16" \ --gradient_accumulation_steps="2" \ --learning_rate="3e-4" \ --warmup_steps="500" \ --evaluation_strategy="steps" \ --text_column_name="sentence" \ --length_column_name="input_length" \ --save_steps="400" \ --eval_steps="100" \ --layerdrop="0.0" \ --save_total_limit="3" \ --freeze_feature_encoder \ --gradient_checkpointing \ --chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \ --fp16 \ --group_by_length \ --push_to_hub \ --do_train --do_eval

The ERROR :

raise RuntimeError("Failed to load audio from {}".format(filepath)) RuntimeError: Failed to load audio from /root/.cache/huggingface/datasets/downloads/extracted``/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips``/common_voice_tr_17346025.mp3

Who can help?

@patrickvonplaten @anton-l

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

i just run the steps written on example folder

Expected behavior

i just want to get the result

mehrdad78 commented 2 years ago

feedback response with error code

I didn't get what do you mean?

LysandreJik commented 2 years ago

Hey @mehrdad78, could you share the full stack trace?

mehrdad78 commented 2 years ago

Hey @mehrdad78, could you share the full stack trace?

Yes,sure. here is my colab notebook:https://colab.research.google.com/drive/1jNdztD-Kkk8MCkzPLlLXVr0Z2jSgpkM8?usp=sharing and the stack trace:

`_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=100,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0003,
length_column_name=input_length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/content/transformers/examples/pytorch/speech-recognition/wav2vec2-common_voice-ru-demo/runs/Aug01_10-37-50_87323b63b7db,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=15.0,
optim=adamw_hf,
output_dir=/content/transformers/examples/pytorch/speech-recognition/wav2vec2-common_voice-ru-demo,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=16,
prediction_loss_only=False,
push_to_hub=True,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/content/transformers/examples/pytorch/speech-recognition/wav2vec2-common_voice-ru-demo,
save_on_each_node=False,
save_steps=400,
save_strategy=steps,
save_total_limit=3,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=500,
weight_decay=0.0,
xpu_backend=None,
)
Downloading builder script: 26.4kB [00:00, 24.1MB/s]       
Downloading metadata: 174kB [00:00, 88.1MB/s]        
Downloading and preparing dataset common_voice/ru (download: 3.40 GiB, generated: 4.88 GiB, post-processed: Unknown size, total: 8.29 GiB) to /root/.cache/huggingface/datasets/common_voice/ru/6.1.0/a1dc74461f6c839bfe1e8cf1262fd4cf24297e3fbd4087a711bd090779023a5e...
Downloading data: 100% 3.66G/3.66G [01:57<00:00, 31.0MB/s]
Dataset common_voice downloaded and prepared to /root/.cache/huggingface/datasets/common_voice/ru/6.1.0/a1dc74461f6c839bfe1e8cf1262fd4cf24297e3fbd4087a711bd090779023a5e. Subsequent calls will reuse this data.
08/01/2022 10:43:49 - WARNING - datasets.builder - Reusing dataset common_voice (/root/.cache/huggingface/datasets/common_voice/ru/6.1.0/a1dc74461f6c839bfe1e8cf1262fd4cf24297e3fbd4087a711bd090779023a5e)
remove special characters from datasets: 100% 23444/23444 [00:03<00:00, 7780.78ex/s]
remove special characters from datasets: 100% 8007/8007 [00:01<00:00, 7715.10ex/s]
https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp2g_x442y
Downloading config.json: 100% 1.73k/1.73k [00:00<00:00, 2.68MB/s]
storing https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/8508c73cd595eb416a1d517b90762416c0bc6cfbef529578079aeae4d8c14336.7581ed2ee0c677f1e933180df51bd1a668c4a2b6d5fd1297d32069373dac097c
creating metadata file for /root/.cache/huggingface/transformers/8508c73cd595eb416a1d517b90762416c0bc6cfbef529578079aeae4d8c14336.7581ed2ee0c677f1e933180df51bd1a668c4a2b6d5fd1297d32069373dac097c
loading configuration file https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/8508c73cd595eb416a1d517b90762416c0bc6cfbef529578079aeae4d8c14336.7581ed2ee0c677f1e933180df51bd1a668c4a2b6d5fd1297d32069373dac097c
Model config Wav2Vec2Config {
  "_name_or_path": "facebook/wav2vec2-large-xlsr-53",
  "activation_dropout": 0.0,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": false,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForPreTraining"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 1,
  "classifier_proj_size": 256,
  "codevector_dim": 768,
  "contrastive_logits_temperature": 0.1,
  "conv_bias": true,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
    5,
    2,
    2,
    2,
    2,
    2,
    2
  ],
  "ctc_loss_reduction": "sum",
  "ctc_zero_infinity": false,
  "diversity_loss_weight": 0.1,
  "do_stable_layer_norm": true,
  "eos_token_id": 2,
  "feat_extract_activation": "gelu",
  "feat_extract_dropout": 0.0,
  "feat_extract_norm": "layer",
  "feat_proj_dropout": 0.1,
  "feat_quantizer_dropout": 0.0,
  "final_dropout": 0.0,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "layerdrop": 0.1,
  "mask_channel_length": 10,
  "mask_channel_min_space": 1,
  "mask_channel_other": 0.0,
  "mask_channel_prob": 0.0,
  "mask_channel_selection": "static",
  "mask_feature_length": 10,
  "mask_feature_min_masks": 0,
  "mask_feature_prob": 0.0,
  "mask_time_length": 10,
  "mask_time_min_masks": 2,
  "mask_time_min_space": 1,
  "mask_time_other": 0.0,
  "mask_time_prob": 0.075,
  "mask_time_selection": "static",
  "model_type": "wav2vec2",
  "num_adapter_layers": 3,
  "num_attention_heads": 16,
  "num_codevector_groups": 2,
  "num_codevectors_per_group": 320,
  "num_conv_pos_embedding_groups": 16,
  "num_conv_pos_embeddings": 128,
  "num_feat_extract_layers": 7,
  "num_hidden_layers": 24,
  "num_negatives": 100,
  "output_hidden_size": 1024,
  "pad_token_id": 0,
  "proj_codevector_dim": 768,
  "tdnn_dilation": [
    1,
    2,
    3,
    1,
    1
  ],
  "tdnn_dim": [
    512,
    512,
    512,
    512,
    1500
  ],
  "tdnn_kernel": [
    5,
    3,
    3,
    1,
    1
  ],
  "transformers_version": "4.22.0.dev0",
  "use_weighted_layer_sum": false,
  "vocab_size": 32,
  "xvector_output_dim": 512
}

100% 1/1 [00:00<00:00,  2.69ba/s]
100% 1/1 [00:00<00:00,  8.21ba/s]
Didn't find file /content/transformers/examples/pytorch/speech-recognition/wav2vec2-common_voice-ru-demo/tokenizer_config.json. We won't load it.
Didn't find file /content/transformers/examples/pytorch/speech-recognition/wav2vec2-common_voice-ru-demo/added_tokens.json. We won't load it.
Didn't find file /content/transformers/examples/pytorch/speech-recognition/wav2vec2-common_voice-ru-demo/special_tokens_map.json. We won't load it.
loading file /content/transformers/examples/pytorch/speech-recognition/wav2vec2-common_voice-ru-demo/vocab.json
loading file None
loading file None
loading file None
Adding <s> to the vocabulary
Adding </s> to the vocabulary
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/preprocessor_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpwqmsvu6p
Downloading preprocessor_config.json: 100% 212/212 [00:00<00:00, 360kB/s]
storing https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/preprocessor_config.json in cache at /root/.cache/huggingface/transformers/281aea0033110ab616ee4c2840ee83ed30496bb549916b8aec6c5668109f9e79.d4484dc1c81456a2461485e7168b04347a7b9a4e3b1ef3aba723323b33e12326
creating metadata file for /root/.cache/huggingface/transformers/281aea0033110ab616ee4c2840ee83ed30496bb549916b8aec6c5668109f9e79.d4484dc1c81456a2461485e7168b04347a7b9a4e3b1ef3aba723323b33e12326
loading feature extractor configuration file https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/preprocessor_config.json from cache at /root/.cache/huggingface/transformers/281aea0033110ab616ee4c2840ee83ed30496bb549916b8aec6c5668109f9e79.d4484dc1c81456a2461485e7168b04347a7b9a4e3b1ef3aba723323b33e12326
Feature extractor Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0,
  "return_attention_mask": true,
  "sampling_rate": 16000
}

https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpio5rku8q
Downloading pytorch_model.bin: 100% 1.18G/1.18G [00:19<00:00, 65.5MB/s]
storing https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/5d2a20b45a1689a376ec4a6282b9d9be42f931cdf8daf07c3668ba1070a059d9.622b46163a38532eae8ac5423b0481dfc0b9ea401af488b5141772bdff889079
creating metadata file for /root/.cache/huggingface/transformers/5d2a20b45a1689a376ec4a6282b9d9be42f931cdf8daf07c3668ba1070a059d9.622b46163a38532eae8ac5423b0481dfc0b9ea401af488b5141772bdff889079
loading weights file https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/5d2a20b45a1689a376ec4a6282b9d9be42f931cdf8daf07c3668ba1070a059d9.622b46163a38532eae8ac5423b0481dfc0b9ea401af488b5141772bdff889079
Some weights of the model checkpoint at facebook/wav2vec2-large-xlsr-53 were not used when initializing Wav2Vec2ForCTC: ['project_hid.bias', 'project_hid.weight', 'quantizer.weight_proj.weight', 'quantizer.weight_proj.bias', 'project_q.weight', 'project_q.bias', 'quantizer.codevectors']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
preprocess datasets:   0% 0/23444 [00:00<?, ?ex/s]
Traceback (most recent call last):
  File "run_speech_recognition_ctc.py", line 769, in <module>
    main()
  File "run_speech_recognition_ctc.py", line 628, in main
    desc="preprocess datasets",
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 790, in map
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 790, in <dictcomp>
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2405, in map
    desc=desc,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 524, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 480, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2756, in _map_single
    example = apply_function_on_filtered_inputs(example, i, offset=offset)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2655, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2347, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "run_speech_recognition_ctc.py", line 609, in prepare_dataset
    sample = batch[audio_column_name]
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 123, in __getitem__
    value = decode_nested_example(self.features[key], value) if value is not None else None
  File "/usr/local/lib/python3.7/dist-packages/datasets/features/features.py", line 1260, in decode_nested_example
    return schema.decode_example(obj, token_per_repo_id=token_per_repo_id) if obj is not None else None
  File "/usr/local/lib/python3.7/dist-packages/datasets/features/audio.py", line 144, in decode_example
    array, sampling_rate = self._decode_mp3(file if file else path)
  File "/usr/local/lib/python3.7/dist-packages/datasets/features/audio.py", line 293, in _decode_mp3
    array, sampling_rate = torchaudio.load(path_or_file, format="mp3")
  File "/usr/local/lib/python3.7/dist-packages/torchaudio/backend/sox_io_backend.py", line 227, in load
    return _fallback_load(filepath, frame_offset, num_frames, normalize, channels_first, format)
  File "/usr/local/lib/python3.7/dist-packages/torchaudio/backend/sox_io_backend.py", line 29, in _fail_load
    raise RuntimeError("Failed to load audio from {}".format(filepath))
RuntimeError: Failed to load audio from /root/.cache/huggingface/datasets/downloads/extracted/707cd877a91cbe3455d83b9f62c3656e094f633f257743683372c05f4620af3b/cv-corpus-6.1-2020-12-11/ru/clips/common_voice_ru_18849051.mp3`

LysandreJik commented 2 years ago

Have you ever encountered this error @albertvillanova @mariosasko ?

albertvillanova commented 2 years ago

Hi @mehrdad78, thanks for reporting (and thanks @LysandreJik for drawing my attention to this).

I have manually checked the TAR file, its content and specifically the MP3 file raising the error: cv-corpus-6.1-2020-12-11/ru/clips/common_voice_ru_18849051.mp3

I can load it without any problem (our Datasets library, under the hood uses torchaudio for mp3 files):

In [1]: import torchaudio

In [2]: path = "./data/common_voice/ru/cv-corpus-6.1-2020-12-11/ru/clips/common_voice_ru_18849051.mp3"

In [3]: data = torchaudio.load(path, format="mp3")

In [4]: data
Out[4]: 
(tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -2.6095e-04,
           3.2425e-05,  8.8751e-05]]),
 48000)

This makes me think that maybe the source of your issue is sox. This is a non-Python dependency that must be installed manually using your operating system package manager, e.g.

sudo apt-get install sox

You have the installation instruction of Datasets with support for Audio in our docs: Installation > Audio

albertvillanova commented 2 years ago

Issue opened in Datasets to raise a more actionable error message:

https://github.com/huggingface/datasets/issues/4776

mehrdad78 commented 2 years ago

Hi @mehrdad78, thanks for reporting (and thanks @LysandreJik for drawing my attention to this).

I have manually checked the TAR file, its content and specifically the MP3 file raising the error: cv-corpus-6.1-2020-12-11/ru/clips/common_voice_ru_18849051.mp3

I can load it without any problem (our Datasets library, under the hood uses torchaudio for mp3 files):
In [1]: import torchaudio

In [2]: path = "./data/common_voice/ru/cv-corpus-6.1-2020-12-11/ru/clips/common_voice_ru_18849051.mp3"

In [3]: data = torchaudio.load(path, format="mp3")

In [4]: data
Out[4]: 
(tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -2.6095e-04,
           3.2425e-05,  8.8751e-05]]),
 48000)
This makes me think that maybe the source of your issue is sox. This is a non-Python dependency that must be installed manually using your operating system package manager, e.g.
sudo apt-get install sox
You have the installation instruction of Datasets with support for Audio in our docs: Installation > Audio

Thank you. I try it and report the result.

albertvillanova commented 2 years ago

I have just read that apparently there is a backend change in latest torchaudio release.

Therefore, torchaudio version should be restricted so that it continues using sox backend, as expected by datasets.

pip install "torchaudio<0.12.0"

We should address this issue to support latest torchaudio.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yinghuozijin commented 1 year ago

@albertvillanova Solves my issue, thank you.

huggingface / transformers