Closed sanchit-gandhi closed 2 years ago
Hey @sanchit-gandhi,
Could you provide the exactly training command that you used as well as your environment info so that I can verify on a V100 from my side?
Model script:
# checkpoints to leverage
encoder_id = "facebook/wav2vec2-large-lv60"
decoder_id = "bert-large-uncased"
feature_extractor = AutoFeatureExtractor.from_pretrained(encoder_id)
feature_extractor.save_pretrained("./")
tokenizer = AutoTokenizer.from_pretrained(decoder_id)
tokenizer.save_pretrained("./")
model = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained(encoder_id, decoder_id, encoder_add_adapter=True)
model.config.encoder.feat_proj_dropout = 0.0
model.config.encoder.final_dropout = 0.0
model.config.encoder.mask_time_prob = 0.1
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.eos_token_id = tokenizer.sep_token_id
model.config.max_length = 50
model.config.num_beams = 1
model.config.encoder.layerdrop = 0.0
model.config.use_cache = False
model.config.decoder.use_cache = False
model.config.processor_class = "Wav2Vec2Processor"
# check if generation works
out = model.generate(torch.ones((1, 2000)))
model.save_pretrained("./")
Bash script:
#!/usr/bin/env bash
CUDA_AVAILABLE_DEVICES=0 python run_speech_recognition_seq2seq.py \
--dataset_name="librispeech_asr" \
--model_name_or_path="./" \
--dataset_config_name="clean" \
--train_split_name="train.100" \
--eval_split_name="validation" \
--output_dir="./" \
--preprocessing_num_workers="1" \
--length_column_name="input_length" \
--overwrite_output_dir \
--num_train_epochs="1" \
--per_device_train_batch_size="4" \
--per_device_eval_batch_size="4" \
--gradient_accumulation_steps="2" \
--generation_max_length="40" \
--generation_num_beams="1" \
--learning_rate="3e-4" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--text_column_name="text" \
--save_steps="500" \
--eval_steps="500" \
--logging_steps="1" \
--save_total_limit="1" \
--freeze_feature_encoder \
--gradient_checkpointing \
--fp16 \
--group_by_length \
--predict_with_generate \
--do_lower_case \
--do_eval --do_train \
--push_to_hub \
--use_auth_token
Environment:
- `transformers` version: 4.17.0.dev0
- Platform: Linux-5.11.0-1028-gcp-x86_64-with-glibc2.33
- Python version: 3.9.5
- PyTorch version (GPU?): 1.10.2+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): 0.4.0 (gpu)
- Jax version: 0.2.28
- JaxLib version: 0.1.76
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Great, thanks for the repro!
In this case, it looks like you are using the default Adam optimizer which can be quite heavy (it uses 3x the model parameters for it's optimizer state).
In a first step, I would try to replace torch's native Adam by https://github.com/facebookresearch/bitsandbytes' Adam as shown here: https://github.com/huggingface/transformers/blob/552f8d30917cabd738d1c32a9e047f2da3ae1b28/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_bnb.py#L678
If this still uses to much memory, I'd probably start looking into using adafactor instead. This should be as simlpe as adding a --adafactor
flag to the command above.
Thanks for the reply, Patrick!
I first tried using the 8-bit implementation of Adam from 'bits and bytes' that you cited. Even with a batch size of 1, this throws the CUDA out of memory error on the GPU.
Upon inspection of the codebase at https://github.com/facebookresearch/bitsandbytes, Adam8bit
appears not to support the option of Adafactor. Instead, I tried using the Hugging Face Adafactor optimizer at https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.Adafactor. However, despite this change, a batch size of 1 still exceeds the GPU memory limit.
# checkpoints to leverage encoder_id = "facebook/wav2vec2-large-lv60" decoder_id = "bert-large-uncased" feature_extractor = AutoFeatureExtractor.from_pretrained(encoder_id) feature_extractor.save_pretrained("./") tokenizer = AutoTokenizer.from_pretrained(decoder_id) tokenizer.save_pretrained("./") model = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained(encoder_id, decoder_id, encoder_add_adapter=True) model.config.encoder.feat_proj_dropout = 0.0 model.config.encoder.final_dropout = 0.0 model.config.encoder.mask_time_prob = 0.1 model.config.decoder_start_token_id = tokenizer.cls_token_id model.config.pad_token_id = tokenizer.pad_token_id model.config.eos_token_id = tokenizer.sep_token_id model.config.max_length = 50 model.config.num_beams = 1 model.config.encoder.layerdrop = 0.0 model.config.use_cache = False model.config.decoder.use_cache = False model.config.processor_class = "Wav2Vec2Processor" # check if generation works out = model.generate(torch.ones((1, 2000))) model.save_pretrained("./")
Small tip here. I cannot copy-paste and re-run the command. It says "NameError: name 'AutoFeatureExtractor' is not defined"
. It saves a lot of time if every script can directly be re-run without missing imports :-)
I see what the error probably is. It's not "CUDA_AVAILABLE_DEVICES"
, but CUDA_VISIBLE_DEVICES
(sorry I might have given you that non-existing command :D). Just tried it out on a dummy dataset and it works fine with CUDA_VISIBLE_DEVICES=0
even with normal Adam and batch_size=4
Could you try again? Pretty sure it should work at least with bnb
this time on the larger dataset.
Some more explanation on what happened and how one could have debugged this. Since CUDA_AVAILABLE_DEVICES
doesn't exist, adding the bash variable didn't have any effect, which then meant that you used PyTorch's Data Parallelism by default: https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html . This is not really maintained anymore by PyTorch actually and it's strongly recommended to switch to DDP: https://pytorch.org/docs/stable/notes/ddp.html instead. However what we want to do here is simply use one GPU so by adding the (correct) bash env CUDA_VISIBLE_DEVICES=0
we can run two trainings (one on each GPU at the same time).
For debugging tips: It's often a good idea to monitor the GPUs when starting to train. This can be done by having a window that runs watch -n 0.1 nvidia-smi
and should monitor GPU usage. Here it quickly became obvious that both GPUs are used instead of just one meaning that there was a problem with the bash command
@sgugger - do you think it makes sense to throw a warning when a user is using PyTorch's DP with the Trainer? I don't really see a use case where DP is preferred over DDP
A warning seems a bit violent, it's not something PyTorch has deprecated, but we can certainly show an info.
Correcting CUDA_AVAILABLE_DEVICES
to CUDA_VISIBLE_DEVICES
rectified the issue! On the full LibriSpeech dataset, I am able to use a batch_size=8
and the 8-bit bnb
optimizer to run training at ~15GB memory usage on a single GPU.
Thanks Patrick!
When training a
wav2vec2-2-bert-large
model on the LibriSpeech ASR corpus and on an NVIDIA Tesla V100 GPU with the following training hyperparameters:per_device_train_batch_size=4
per_device_eval_batch_size=4
gradient_accumulation_steps=2
generation_num_beams=1
the GPU memory is exhausted and an out of memory error returned:
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 13.46 GiB already allocated; 5.25 MiB free; 13.93 GiB reserved in total by PyTorch)
Reducing the training batch size to 1 and increasing the number of gradient accumulation steps still returns an out of memory error. What measures can be taken to effectively reduce the memory?