OleguerCanal commented 2 years ago

❓ Questions & Help

I am training using the random sampler and using ddp_sharded strategy, however after some training steps I get a CUDA out-of-memory error.

Details

I am training on a SLURM-managed cluster using 2 nodes with 2 Tesla M60 (8GB) GPUs each. As I understand if the model doesnt fit in a single GPU, pytorch-lightning would automatically scatter it across different ones.

As expected, the larger I make the batch size the sooner the error occurs. What I find strange is that it takes a few iterations to crash, if the model and the batch didn't fit wouldn't it crash directly?

Here I attach a picture of the memory usage:

And these are the parameters I'm using:

audio:
  name: fbank
  sample_rate: 16000
  frame_length: 20.0
  frame_shift: 10.0
  del_silence: false
  num_mels: 80
  apply_spec_augment: true
  apply_noise_augment: false
  apply_time_stretch_augment: false
  apply_joining_augment: false
augment:
  apply_spec_augment: false
  apply_noise_augment: false
  apply_joining_augment: false
  apply_time_stretch_augment: false
  freq_mask_para: 27
  freq_mask_num: 2
  time_mask_num: 4
  noise_dataset_dir: None
  noise_level: 0.7
  time_stretch_min_rate: 0.7
  time_stretch_max_rate: 1.4
dataset:
  dataset: librispeech
  dataset_path: /home/ubuntu/data/librispeech
  dataset_download: false
  manifest_file_path: /home/ubuntu/data/librispeech/libri_subword_manifest.txt
criterion:
  criterion_name: cross_entropy
  reduction: mean
lr_scheduler:
  lr: 0.0001
  scheduler_name: warmup_reduce_lr_on_plateau
  lr_patience: 1
  lr_factor: 0.3
  peak_lr: 0.0001
  init_lr: 1.0e-10
  warmup_steps: 4000
model:
  model_name: conformer_lstm
  encoder_dim: 256
  num_encoder_layers: 6
  num_attention_heads: 4
  feed_forward_expansion_factor: 4
  conv_expansion_factor: 2
  input_dropout_p: 0.1
  feed_forward_dropout_p: 0.1
  attention_dropout_p: 0.1
  conv_dropout_p: 0.1
  conv_kernel_size: 31
    half_step_residual: true
  num_decoder_layers: 2
  decoder_dropout_p: 0.1
  max_length: 128
  teacher_forcing_ratio: 1.0
  rnn_type: lstm
  decoder_attn_mechanism: loc
  optimizer: adam
trainer:
  seed: 1
  accelerator: ddp_sharded  # This I have hardcoded the necessary parts, I basically say use 2 nodes with 2 gpus each
  accumulate_grad_batches: 1
  num_workers: 4
  batch_size: 16
  check_val_every_n_epoch: 1
  gradient_clip_val: 5.0
  logger: wandb
  max_epochs: 20
  save_checkpoint_n_steps: 10000
  auto_scale_batch_size: binsearch
  sampler: random
  name: gpu
  device: gpu
  use_cuda: true
  auto_select_gpus: true
tokenizer:
  sos_token: <s>
  eos_token: </s>
  pad_token: <pad>
  blank_token: <blank>
  encoding: utf-8

Thank you a lot guys!

sooftware commented 2 years ago

When the audio input length is long, the memory seems to explode.

resurgo97 commented 2 years ago

@sooftware so is it designed in such way that the input length increases within an epoch?

sooftware commented 2 years ago

That's not true. Perhaps every time a memory is held in a GPU, the memory held in a cache increases and seems to be increasing. Then the memory explodes. (I think)

OleguerCanal commented 2 years ago

Any ideas on how to solve it?

ljk8800 commented 2 years ago

same problem encountered, any ideas to slove?

virgile-blg commented 2 years ago

Hi, Not sure if it's really a memory leak, as the audio batches can have different length during training. In my case I had the GPU mem % increasing then decreasing during the training.

Have you tried to decrease the batch size in order to have some room left for longer sequence batches ?

OleguerCanal commented 2 years ago

I think @virgile-blg is right, although it is a bit weird that it keeps increasing over the first epochs and then it becomes more stable

virgile-blg commented 2 years ago

One option is to disable Pytorch Lightning's auto_scale_batch_size. When set to False there is not OOM error during the 1st epoch. I guess that it is scaling the batch size not using the biggest sequence in the training set.

openspeech-team / openspeech

Could there be a memory leak in the conformer_lstm model? #123

❓ Questions & Help

Details