Open majercakdavid opened 1 year ago
@majercakdavid, can you please share log/stack trace?
@tjruwase surely, this is log for 0-th process std_log_process_0.txt
Based on your log, it looks like OOM is caused by activation memory consumption. The screenshot below shows that deepseed.init() offloaded model state so that GPU memory is almost empty
ZeRO helps with memory consumption of model states, but not of activations. You will need to use gradient checkpointing to fit these activations. The link you provided shows some example of gradient checkpointing usage. Have you tried those? Also, can you share your actual command line? Thanks!
@majercakdavid, do you still need this opened?
@tjruwase unfortunately yes. After I did checkpointing for the forward pass I still get OOM error for backward pass. Let me attach the logs: std_log_process_0 (2).txt
@tjruwase if I use fp16
I can use 96x96x96, however I get NaN for loss. If I use bfloat16
I get loss values and can use 64x64x64 tensor as input but as soon as I use 96x96x96 I get following error:
std_log_process_0 (3).txt
It seems you are running out of GPU memory. Can you share logs for 64x64x64 with bfloat16?
@tjruwase sorry for late response: std_log_process_0 (4).txt
Describe the bug I'm trying to run training of SwinUNETR model on a multi-GPU node (4xV00 - 16GB VRAM) with effective batch size per GPU of 1 and sample size 96x96x96. However, even after many tweak in DS config I'm still getting CUDA OOM error.
To Reproduce Steps to reproduce the behavior:
Expected behavior Training proceeds without OOM error
System info (please complete the following information):
Launcher context AML pipeline with PyTorch distribution:
Docker context mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.6-cudnn8-ubuntu20.04
Additional context