Open orrzohar opened 2 months ago
Hi @orrzohar - can you please share the DeepSpeed version you are using as well as the ds_report that will tell us more about your DeepSpeed install?
@orrzohar For clarification, does video_frames
have different number of frame
on different ranks in your repro?
Also, what are the sizes of clip_model
and remaining part? If clip_model
is small enough and only mlp
needs ZeRO3, we can make them separated two models.
Hi @loadams , @tohtana,
deepspeed==0.13.5 installed via pyproject.toml:
dependencies = [
"tokenizers==0.19.1", "sentencepiece==0.1.99", "shortuuid",
"accelerate==0.33.0", "peft", "bitsandbytes",
"pydantic", "markdown2[all]", "numpy", "scikit-learn==1.2.2",
"gradio", "gradio_client==0.8.1", "easydict",
"requests", "httpx==0.24.0", "uvicorn", "fastapi",
"einops==0.6.1", "einops-exts==0.0.4", "timm==0.6.13",
"fairscale", "decord", "opencv-python", "chardet",
"datasets==2.16.1", "openai==1.8.0", "webdataset==0.2.86",
"transformers==4.44.0", "ezcolorlog", "pytorchvideo",
"s2wrapper@git+https://github.com/bfshi/scaling_on_scales"
]
[project.optional-dependencies]
train = ["deepspeed==0.13.5", "ninja", "wandb", "ipdb"]
ds_report:
[2024-09-10 08:07:30,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
DeepSpeed general environment info:
torch install path ............... ['miniconda3/envs/<env_name>/lib/python3.10/site-packages/torch']
torch version .................... 2.1.0+cu121
deepspeed install path ........... ['miniconda3/envs/<env_name>/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.3, cuda 12.1
shared memory (/dev/shm) size .... 373.87 GB
The clip model is typlically ~300M, the MLP is ~100M-300M, and the LLM is ~1.5-7B parameters. Ideally, I would like to detach
somehow the vision encoder from the computational graph so i can use ZeRO3 on the LLM/connector
I will also note: this is based off of the LLaVA codebase + therefore uses a huggingface trainer wrapper
Describe the bug I am training a video-llm model, where I encode log videos with a varying number of forward passes to avoid OOM issues. I would like to use ZeRO3, but using a part of the model a different number of times causes the computational graph to be different across nodes/GPUs, and ZeRO3 to hang.
I don't need to compute the gradients over the video encoder, and would like to just completely remove it from the computation graph. I have tried: (1) freezing the encoder, (2) applying the encoder with a
@torch.no_grad()
, and (3) using.detach()
on the output tensors, but to no avail.How can I effectively `remove' a part of the model checkpointing, if at all possible? I can't pre-encode entire videos as this is too memory heavy for my setup.
To Reproduce Steps to reproduce the behavior: Take any model, apply some part of it multiple times. e.g;
use this zero3 json:
Expected behavior I would like to find a way to still be able to encode long videos, while using zero3 as when I train larger LLMs, zero3 becomes very important.
ds_report output Please run
ds_report
to give us details about your setup. NCCL/hanging.Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context Using deepspeed launcher with hostfile
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.