[BUG] DeepSpeed Cuda OOM on SwinUNETR from MONAI

majercakdavid commented 1 year ago

Describe the bug I'm trying to run training of SwinUNETR model on a multi-GPU node (4xV00 - 16GB VRAM) with effective batch size per GPU of 1 and sample size 96x96x96. However, even after many tweak in DS config I'm still getting CUDA OOM error.

To Reproduce Steps to reproduce the behavior:

Clone 'MONAI SwinUNETR'

Use deepspeed init with following configuration:

{
"train_micro_batch_size_per_gpu": 1,
"steps_per_print": 1,
"fp16": {
    "enabled": true    },
"optimizer": {
    "type": "Adam",
    "params": {
        "lr": 0.001,
        "betas": [
            0.8,
            0.999            ],
        "eps": 1e-8,
        "weight_decay": 3e-7        }
},
"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": 0,
        "warmup_max_lr": 0.001,
        "warmup_num_steps": 100        }
},
"wall_clock_breakdown": false,
"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu"        },
    "offload_param": {
        "device": "cpu"        },
    "contiguous_gradients": true,
    "overlap_comm": false,
    "allgather_bucket_size": 5e5,
    "reduce_bucket_size": 5e5    },
"zero_allow_untested_optimizer": false,
"activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false    }
}

Get OOM error

Expected behavior Training proceeds without OOM error

System info (please complete the following information):

OS: Ubuntu 20.04
GPU 4xV100 - 16GB VRAM
Python version: 3.8

Launcher context AML pipeline with PyTorch distribution:

distribution:
  type: pytorch

Docker context mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.6-cudnn8-ubuntu20.04

Additional context

Am I missing any further optimizations I can do?
Is it possible to make train_batch_sizesmaller than #GPUs such that the GPUs can then share memory
How to effectively enable model parallel in DeepSpeed it that is even possible

tjruwase commented 1 year ago

@majercakdavid, can you please share log/stack trace?

majercakdavid commented 1 year ago

@tjruwase surely, this is log for 0-th process std_log_process_0.txt

tjruwase commented 1 year ago

Based on your log, it looks like OOM is caused by activation memory consumption. The screenshot below shows that deepseed.init() offloaded model state so that GPU memory is almost empty

ZeRO helps with memory consumption of model states, but not of activations. You will need to use gradient checkpointing to fit these activations. The link you provided shows some example of gradient checkpointing usage. Have you tried those? Also, can you share your actual command line? Thanks!

tjruwase commented 1 year ago

@majercakdavid, do you still need this opened?

majercakdavid commented 1 year ago

@tjruwase unfortunately yes. After I did checkpointing for the forward pass I still get OOM error for backward pass. Let me attach the logs: std_log_process_0 (2).txt

majercakdavid commented 1 year ago

@tjruwase if I use fp16 I can use 96x96x96, however I get NaN for loss. If I use bfloat16 I get loss values and can use 64x64x64 tensor as input but as soon as I use 96x96x96 I get following error: std_log_process_0 (3).txt

tjruwase commented 1 year ago

It seems you are running out of GPU memory. Can you share logs for 64x64x64 with bfloat16?

majercakdavid commented 1 year ago

@tjruwase sorry for late response: std_log_process_0 (4).txt

microsoft / DeepSpeed

[BUG] DeepSpeed Cuda OOM on SwinUNETR from MONAI #2930