Open acisseJZhong opened 2 weeks ago
@ebsmothers, do you think it would make sense to ping someone from FSDP?
Could you try modifying the init_process_group
call to use the gloo
backend for cpu? Perhaps it should initialize both nccl
for gpu and gloo
for cpu?
https://github.com/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py#L903
I don’t think we want to modify init_process_group here. To me that error indicates that we are trying to call some comms primitive on a tensor that’s already on CPU, which we shouldn’t be doing. Initializing process group on CPU would only be helpful if we actually want distributed training on CPU, which we don’t. Let’s debug a bit more and then we can loop in distributed folks if needed.
I believe when CPU offload is used in FSDP, gradients will be transferred to CPU during the backward pass (to free up gradients memory, similar to optim in backward) to perform optimizer step on CPU. That's probably why you see cpu
device there, because the gradients are on CPU now. They are DTensor
, hence when you run gradient clipping, which calls .sum()
or some sort, it will try to do all-reduce, hence the error.
It's probably faster to check with the distributed folks if FSDP w/ CPU offload support gradient clipping in general. Even if it is technically possible (e.g. do clipping on CPU), I think it would be too slow + possibly require changes in internal FSDP code.
Looks like torchtitan repo ran into the same issue and someone created a quick workaround in a special branch: https://github.com/pytorch/torchtitan/pull/622/files
I am running the full finetune distributed recipe, when setting
clip_grad_norm: 1.0
andfsdp_cpu_offload: True
, it raises errorRuntimeError: No backend type associated with device type cpu
Full error stack trace:
Wondering how should we fix this error?