CUDA Out of Memory Errors w Batch Size of 1 on 16GB V100

jordanparker6 commented 2 years ago

Using the default FeatureExtractor settings for the HuggingFace port of YOLOS, I am consistently running into CUDA OOM errors on a 16GB V100 (even with a training batch size of 1).

I would like to train YOLOS on publaynet and ideally use 4-8 V100s.

Is there a way to lower the CUDA memory usage while training YOLOS besides batch size (whilst preserving the accuracy and leveraging the pertained models)?

I see that other models (e.g. DiT) use image sizes of 244x244. However, is it fair to assume that such a small image size would not be appropriate for object detection as too much information is lost? In the DiT case document image classification was the objective.

Yuxin-CV commented 2 years ago

Hi, for the memory issue, please refer to https://github.com/hustvl/YOLOS/issues/5#issuecomment-867533669

jordanparker6 commented 2 years ago

Ahh that great! Thank you.

jordanparker6 commented 2 years ago

For those interested, I found that the HF implementation is set up for Gradient Accumulation.

Enable it with:

    self.model = YolosForObjectDetection.from_pretrained(
      self.hparams.pretrained_model_name_or_path, 
      config=config,
      ignore_mismatched_sizes=True
    )
    self.model.gradient_checkpointing_enable()

I was able to increase the batch size from 1 to 8 using this on a T4 with dpp_sharded in pytorch-lightning. It shaved about 35 mins off per epoch reducing the per epoch time from 165mins to 130mins.

model:
  pretrained_model_name_or_path: "hustvl/yolos-base"
  learning_rate: 2e-5
data:
  data_dir: "/datastores/doclaynet/images"
  train_batch_size: 8
  val_batch_size: 8
  num_workers: 4
trainer:
  resume_from_checkpoint: null 
  accelerator: "gpu"
  num_nodes: 1
  strategy: "ddp_sharded"
  max_epochs: 10
  min_epochs: 3
  max_steps: -1
  val_check_interval: 1.0
  check_val_every_n_epoch: 1
  gradient_clip_val: 1.0

Yuxin-CV commented 2 years ago

For those interested, I found that the HF implementation is set up for Gradient Accumulation.

Enable it with:
    self.model = YolosForObjectDetection.from_pretrained(
      self.hparams.pretrained_model_name_or_path, 
      config=config,
      ignore_mismatched_sizes=True
    )
    self.model.gradient_checkpointing_enable()
I was able to increase the batch size from 1 to 8 using this on a T4 with dpp_sharded in pytorch-lightning. It shaved about 35 mins off per epoch reducing the per epoch time from 165mins to 130mins.
model:
  pretrained_model_name_or_path: "hustvl/yolos-base"
  learning_rate: 2e-5
data:
  data_dir: "/datastores/doclaynet/images"
  train_batch_size: 8
  val_batch_size: 8
  num_workers: 4
trainer:
  resume_from_checkpoint: null 
  accelerator: "gpu"
  num_nodes: 1
  strategy: "ddp_sharded"
  max_epochs: 10
  min_epochs: 3
  max_steps: -1
  val_check_interval: 1.0
  check_val_every_n_epoch: 1
  gradient_clip_val: 1.0

Awesome!:smiling_face_with_three_hearts::smiling_face_with_three_hearts::smiling_face_with_three_hearts:

hustvl / YOLOS

CUDA Out of Memory Errors w Batch Size of 1 on 16GB V100 #27