Closed flozi00 closed 2 years ago
Hey @flozi00,
This looks like a difficult error. Can we try to debug it step-by-step?
--sharded_ddp simple
and in a single-GPU environment (without python -m torch.distributed.launch
- just python run_speech_recognition_ctc.py
--sharded_ddp simple
and in a multi-GPU environment (with python -m torch.distributed.launch
)--sharded_ddp
. BTW I've never tested sharded_ddp
with Wav2Vec2.What do you need it for exactly?
Also why do you use --nproc_per_node=1
This should be set to the number of GPUs and in case there is just one GPU it's unnecessary to use DDP in general
Its not working with python -m torch.distributed.launch
in general on my machine
Ok, this should definitely work. How many GPUs do you have?
I'll give it a try on two GPUs tonight!
Ok just tried the following command on two TITAN RTX 24GB RAM:
#!/usr/bin/env bash
python -m torch.distributed.launch \
--nproc_per_node=1 run_speech_recognition_ctc.py \
--dataset_name="common_voice" \
--model_name_or_path="facebook/wav2vec2-xls-r-1b" \
--dataset_config_name="ab" \
--output_dir="./wav2vec2-xls-r-1b-german" \
--overwrite_output_dir \
--num_train_epochs="5" \
--per_device_train_batch_size="12" \
--gradient_accumulation_steps="1" \
--learning_rate="3e-4" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--text_column_name="sentence" \
--save_steps="400" \
--layerdrop="0.0" \
--save_total_limit="3" \
--freeze_feature_extractor \
--gradient_checkpointing \
--fp16 --fp16_opt_level="03" \
--group_by_length \
--do_train --do_eval \
--logging_steps=10 \
--eval_steps=25000 \
--max_train_samples=50 --max_eval_samples=50 \
and it works fine.
My env is as follows:
- `transformers` version: 4.15.0.dev0 (current master)
- Platform: Linux-5.3.0-64-generic-x86_64-with-glibc2.17
- Python version: 3.8.12
- PyTorch version (GPU?): 1.10.0+cu102 (True)
- Tensorflow version (GPU?): 2.6.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
- Jax version: 0.2.19
- JaxLib version: 0.1.70
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
and
- `datasets` version: 1.16.2.dev0 (current master)
- Platform: Linux-5.3.0-64-generic-x86_64-with-glibc2.17
- Python version: 3.8.12
- PyArrow version: 6.0.1
Note that I use the ab
config as it's a small dataset and easy to test. Besides that I've only removed the --sharded_ddp
option. Can you verify whether the above script works for you?
With larger dataset and many steps it even happens with the single node setup. I think I need to reset my machine, maybe there is something wrong with cuda
@flozi00 you can also try running the script as CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch ...
as the error suggests, to hopefully catch the exact line where it happens (otherwise the stack trace returns an incorrect line due to asynchronous execution)
I did it, here is the new stacktrace
Traceback (most recent call last):
File "run_speech_recognition_ctc.py", line 649, in <module>
main()
File "run_speech_recognition_ctc.py", line 600, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1325, in train
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1884, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1916, in compute_loss
outputs = model(**inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1618, in forward
outputs = self.wav2vec2(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1239, in forward
extract_features = self.feature_extractor(input_values)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 442, in forward
hidden_states = conv_layer(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 317, in forward
hidden_states = self.layer_norm(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2446, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: CUDA error: an illegal memory access was encountered
What command did you run to get this stack trace?
CUDA_LAUNCH_BLOCKING=1 python run_speech_recognition_ctc.py \
--dataset_name="common_voice" \
--model_name_or_path="facebook/wav2vec2-xls-r-1b" \
--dataset_config_name="de" \
--output_dir="./wav2vec2-xls-r-1b-german" \
--overwrite_output_dir \
--num_train_epochs="15" \
--per_device_train_batch_size="12" \
--gradient_accumulation_steps="1" \
--learning_rate="3e-4" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--text_column_name="sentence" \
--save_steps="400" \
--layerdrop="0.0" \
--save_total_limit="3" \
--freeze_feature_extractor \
--gradient_checkpointing \
--fp16 --fp16_opt_level "03" \
--group_by_length \
--do_train --do_eval \
--logging_steps=10 \
--eval_steps=25000 \
--max_train_samples=5000 --max_eval_samples=5000
Hmm - okey not really sure. BTW, if you run this command on multiple GPUs it'll automatically the Trainer in DP which is known to have some bugs.
Maybe:
CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES="0" python run_speech_recognition_ctc.py \
--dataset_name="common_voice" \
--model_name_or_path="facebook/wav2vec2-xls-r-1b" \
--dataset_config_name="de" \
--output_dir="./wav2vec2-xls-r-1b-german" \
--overwrite_output_dir \
--num_train_epochs="15" \
--per_device_train_batch_size="12" \
--gradient_accumulation_steps="1" \
--learning_rate="3e-4" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--text_column_name="sentence" \
--save_steps="400" \
--layerdrop="0.0" \
--save_total_limit="3" \
--freeze_feature_extractor \
--gradient_checkpointing \
--fp16 --fp16_opt_level "03" \
--group_by_length \
--do_train --do_eval \
--logging_steps=10 \
--eval_steps=25000 \
--max_train_samples=5000 --max_eval_samples=5000
works?
But otherwise I really don't know - the command does work for me.
turned out that batch of 4 is running fine, strange. I tried using deepspeed zero for larger batches but that's returning out of memory at init state. I think I need to setup a clean machine with fresh cuda, hopefully fixing it
Environment info
transformers
version: masterWho can help
@patrickvonplaten @anton-l
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior