[Wav2vec2] RuntimeError: CUDA error: an illegal memory access was encountered

flozi00 commented 2 years ago

Environment info

transformers version: master
Platform: ubuntu
Python version: 3.8
PyTorch version (GPU?): 1.10
Tensorflow version (GPU?):
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help

@patrickvonplaten @anton-l

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

[x] the official example scripts: (give details below) speech recognition ctc
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: (give the name) commonvoice
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

python -m torch.distributed.launch --nproc_per_node=1 run_speech_recognition_ctc.py \
    --dataset_name="common_voice" \
    --model_name_or_path="facebook/wav2vec2-xls-r-1b" \
    --dataset_config_name="de" \
    --output_dir="./wav2vec2-xls-r-1b-german" \
    --overwrite_output_dir \
    --num_train_epochs="15" \
    --per_device_train_batch_size="12" \
    --gradient_accumulation_steps="1" \
    --learning_rate="3e-4" \
    --warmup_steps="500" \
    --evaluation_strategy="steps" \
    --text_column_name="sentence" \
    --save_steps="400" \
    --layerdrop="0.0" \
    --save_total_limit="3" \
    --freeze_feature_extractor \
    --gradient_checkpointing \
    --fp16 --fp16_opt_level "03" \
    --group_by_length \
    --do_train --do_eval \
    --sharded_ddp simple \
    --logging_steps=10 \
    --eval_steps=25000 \
    --max_train_samples=5000 --max_eval_samples=5000 \

Traceback (most recent call last):
  File "run_speech_recognition_ctc.py", line 649, in <module>
    main()
  File "run_speech_recognition_ctc.py", line 600, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1325, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1884, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1916, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 224, in forward
    return self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1618, in forward
    outputs = self.wav2vec2(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1244, in forward
    attention_mask = self._get_feature_vector_attention_mask(
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1082, in _get_feature_vector_attention_mask
    attention_mask[(torch.arange(attention_mask.shape[0], device=attention_mask.device), output_lengths - 1)] = 1
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Expected behavior

patrickvonplaten commented 2 years ago

Hey @flozi00,

This looks like a difficult error. Can we try to debug it step-by-step?

Does it work without --sharded_ddp simple and in a single-GPU environment (without python -m torch.distributed.launch - just python run_speech_recognition_ctc.py
If yes, does it work without --sharded_ddp simple and in a multi-GPU environment (with python -m torch.distributed.launch)
If yes as well then it's --sharded_ddp. BTW I've never tested sharded_ddp with Wav2Vec2.

What do you need it for exactly?

Also why do you use --nproc_per_node=1 This should be set to the number of GPUs and in case there is just one GPU it's unnecessary to use DDP in general

flozi00 commented 2 years ago

Its not working with python -m torch.distributed.launch in general on my machine

patrickvonplaten commented 2 years ago

Ok, this should definitely work. How many GPUs do you have?

I'll give it a try on two GPUs tonight!

patrickvonplaten commented 2 years ago

Ok just tried the following command on two TITAN RTX 24GB RAM:

#!/usr/bin/env bash
python -m torch.distributed.launch \
    --nproc_per_node=1 run_speech_recognition_ctc.py \
    --dataset_name="common_voice" \
    --model_name_or_path="facebook/wav2vec2-xls-r-1b" \
    --dataset_config_name="ab" \
    --output_dir="./wav2vec2-xls-r-1b-german" \
    --overwrite_output_dir \
    --num_train_epochs="5" \
    --per_device_train_batch_size="12" \
    --gradient_accumulation_steps="1" \
    --learning_rate="3e-4" \
    --warmup_steps="500" \
    --evaluation_strategy="steps" \
    --text_column_name="sentence" \
    --save_steps="400" \
    --layerdrop="0.0" \
    --save_total_limit="3" \
    --freeze_feature_extractor \
    --gradient_checkpointing \
    --fp16 --fp16_opt_level="03" \
    --group_by_length \
    --do_train --do_eval \
    --logging_steps=10 \
    --eval_steps=25000 \
    --max_train_samples=50 --max_eval_samples=50 \

and it works fine.

My env is as follows:

- `transformers` version: 4.15.0.dev0 (current master)
- Platform: Linux-5.3.0-64-generic-x86_64-with-glibc2.17
- Python version: 3.8.12
- PyTorch version (GPU?): 1.10.0+cu102 (True)
- Tensorflow version (GPU?): 2.6.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
- Jax version: 0.2.19
- JaxLib version: 0.1.70
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

and

- `datasets` version: 1.16.2.dev0 (current master)
- Platform: Linux-5.3.0-64-generic-x86_64-with-glibc2.17
- Python version: 3.8.12
- PyArrow version: 6.0.1

patrickvonplaten commented 2 years ago

Note that I use the ab config as it's a small dataset and easy to test. Besides that I've only removed the --sharded_ddp option. Can you verify whether the above script works for you?

flozi00 commented 2 years ago

With larger dataset and many steps it even happens with the single node setup. I think I need to reset my machine, maybe there is something wrong with cuda

anton-l commented 2 years ago

@flozi00 you can also try running the script as CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch ... as the error suggests, to hopefully catch the exact line where it happens (otherwise the stack trace returns an incorrect line due to asynchronous execution)

flozi00 commented 2 years ago

I did it, here is the new stacktrace

Traceback (most recent call last):
  File "run_speech_recognition_ctc.py", line 649, in <module>
    main()
  File "run_speech_recognition_ctc.py", line 600, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1325, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1884, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1916, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1618, in forward
    outputs = self.wav2vec2(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1239, in forward
    extract_features = self.feature_extractor(input_values)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 442, in forward
    hidden_states = conv_layer(hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 317, in forward
    hidden_states = self.layer_norm(hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 189, in forward
    return F.layer_norm(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2446, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: CUDA error: an illegal memory access was encountered

patrickvonplaten commented 2 years ago

What command did you run to get this stack trace?

flozi00 commented 2 years ago

CUDA_LAUNCH_BLOCKING=1 python run_speech_recognition_ctc.py \
    --dataset_name="common_voice" \
    --model_name_or_path="facebook/wav2vec2-xls-r-1b" \
    --dataset_config_name="de" \
    --output_dir="./wav2vec2-xls-r-1b-german" \
    --overwrite_output_dir \
    --num_train_epochs="15" \
    --per_device_train_batch_size="12" \
    --gradient_accumulation_steps="1" \
    --learning_rate="3e-4" \
    --warmup_steps="500" \
    --evaluation_strategy="steps" \
    --text_column_name="sentence" \
    --save_steps="400" \
    --layerdrop="0.0" \
    --save_total_limit="3" \
    --freeze_feature_extractor \
    --gradient_checkpointing \
    --fp16 --fp16_opt_level "03" \
    --group_by_length \
    --do_train --do_eval \
    --logging_steps=10 \
    --eval_steps=25000 \
    --max_train_samples=5000 --max_eval_samples=5000

patrickvonplaten commented 2 years ago

Hmm - okey not really sure. BTW, if you run this command on multiple GPUs it'll automatically the Trainer in DP which is known to have some bugs.

Maybe:

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES="0" python run_speech_recognition_ctc.py \
    --dataset_name="common_voice" \
    --model_name_or_path="facebook/wav2vec2-xls-r-1b" \
    --dataset_config_name="de" \
    --output_dir="./wav2vec2-xls-r-1b-german" \
    --overwrite_output_dir \
    --num_train_epochs="15" \
    --per_device_train_batch_size="12" \
    --gradient_accumulation_steps="1" \
    --learning_rate="3e-4" \
    --warmup_steps="500" \
    --evaluation_strategy="steps" \
    --text_column_name="sentence" \
    --save_steps="400" \
    --layerdrop="0.0" \
    --save_total_limit="3" \
    --freeze_feature_extractor \
    --gradient_checkpointing \
    --fp16 --fp16_opt_level "03" \
    --group_by_length \
    --do_train --do_eval \
    --logging_steps=10 \
    --eval_steps=25000 \
    --max_train_samples=5000 --max_eval_samples=5000

works?

But otherwise I really don't know - the command does work for me.

flozi00 commented 2 years ago

turned out that batch of 4 is running fine, strange. I tried using deepspeed zero for larger batches but that's returning out of memory at init state. I think I need to setup a clean machine with fresh cuda, hopefully fixing it

huggingface / transformers