Closed PageTurnIO closed 6 months ago
@PageTurnIO, the underlying issue here is that ZeRO* and offloading help to optimize memory footprint of model state (parameters and optimizer). In this case, however, the OOM is caused by activation memory footprint of the very long context length. So, you will need to explore techniques like sequence-parallelism, such as Ulysses. and activation checkpointing to fit the long context in memory.
@tjruwase thanks for your comment, very helpful! Fully agree with you that this is an activation memory problem. Looking over the Ulysses code it seems compatible with transformers/attention, but not necessarily compatible with state space models. But the principles are all the same.
Using deepspeed on Mamba is indeed a question. Have anyone succeed on this issue?
Describe the bug I am experiencing a CUDA out of memory error while training a Mamba 2.8b model with DeepSpeed using ZeRO 3. The issue occurs during the backward pass, and I have tried adjusting the config file many times, but the problem persists.
The training works fine for context lengths up to 48k tokens, but fails when increasing to > 50k tokens.
CPU memory usage is relatively low (~100 GB out of 1800 GB) during training, indicating that the model parameters and optimizer states are not being properly offloaded to the CPU.
I am trying to figure out if this is a Mamba specific error. Does anyone have experience training state space models on Zero 3 with Deepspeed?
I would greatly appreciate any guidance or suggestions to resolve this issue.
To Reproduce
Deepspeed Config File
Expected behavior The code should use Zero 3 to offload to CPU and/or NVMe, and allow training of longer context sequences.
ds_report output
Screenshots If applicable, add screenshots to help explain your problem.
System Information GPU: 8 x NVIDIA A100 80GB, 240 vCPUs, 1800 GiB RAM, 20 TiB SSD CUDA Version: 12.2 DeepSpeed Version: 0.14.3
Launcher context
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=8 torchrun --nproc_per_node=8 --master_port=$((RANDOM + 10000)) train.py
Additional context I've also tried NVMe and I get the same result.
Training Setup: Model: Mamba-2.8b-slimpj Dataset: Upsampled Slim Pajama dataset with long context lengths (> 50k tokens) Hardware: 8 x A100 80GB GPUs, 240 vCPUs, 1800 GiB RAM, 20 TiB SSD