Open ebsmothers opened 2 weeks ago
cc @weifengpy is there extra initialization logic we need to do for proper CPU offloading support? I know in the LoRA recipe you explicitly call to_empty
on CPU for LoRA params, but thought that was only necessary if we aren't loading those params from the state dict
is there extra initialization logic we need to do for proper CPU offloading support?
I do not expect extra init logic. if we set fully_shard(cpu_offload=)
, FSDP2 will move parameters to cpu during state dict loading. users do not need to call to_empty
or .cpu()
or anything. https://github.com/pytorch/pytorch/blob/ed86ac2f25e7ae5d4e432e1c85837e83f4bd17a6/torch/distributed/_composable/fsdp/_fsdp_param.py#L720
btw: I tested your command and it works on my machine, tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=True
. verified model.parameters()
are on cpu at optimizer.step
if you want to debug at local, a good starting point is figuring out the the fqn for the problematic param. For example, print [(name, param.device) for name, param in model.named_parameters()]
at forward, backward, optimizer.step
@weifengpy thanks for taking a look. After forward, all params are on CUDA, not CPU. After that the error occurs during backward though (specifically during FSDP's post_backward). Btw when you tested were you on PyTorch nightlies?
I have the same issue. The official released version on PyPI works, but main does not.
The recipe and the config are almost the verbatim version of torchtune own recipe with very light modification for instrumentation, so I doubt the issue is coming from the modification (and it used to work on the released version)
tune run --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=cn-n001.server.mila.quebec:29400 --nproc-per-node=8 --
$RECIPES/recipes/full_finetune_distributed.py --config $CONFIG/configs/llama3_70B_full.yaml epochs=1
output_dir=$OUTPUT/output tokenizer.path=$DATA/llama3_70B/original/tokenizer.model
checkpointer.checkpoint_dir=$DATA/llama3_70B checkpointer.output_dir=$DATA/llama3_70B/
metric_logger.log_dir=$DATA/metrics 'repo_id="meta-llama/Meta-Llama-3.1-70B"' safetensors=true batch_size=2 gradient_accumulation_steps=1 [at 2024-08-30
14:33:57.737993]
* 1 x E0830 14:37:56.528000 140676060272448 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2740964) of binary: /tmp/workspace/cuda/results/venv/torch/bin/python
| [rank7]: Traceback (most recent call last):
| [rank7]: File "/home/mila/d/delaunap/milabench/benchmarks/llm/recipes/full_finetune_distributed.py", line 781, in <module>
| [rank7]: sys.exit(recipe_main())
| [rank7]: File "/home/mila/d/delaunap/milabench/benchmarks/llm/tune/torchtune/config/_parse.py", line 99, in wrapper
| [rank7]: sys.exit(recipe_main(conf))
| [rank7]: File "/home/mila/d/delaunap/milabench/benchmarks/llm/recipes/full_finetune_distributed.py", line 774, in recipe_main
| [rank7]: recipe.train()
| [rank7]: File "/home/mila/d/delaunap/milabench/benchmarks/llm/recipes/full_finetune_distributed.py", line 650, in train
| [rank7]: loss.backward()
| [rank7]: File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward
| [rank7]: torch.autograd.backward(
| [rank7]: File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 289, in backward
| [rank7]: _engine_run_backward(
| [rank7]: File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
| [rank7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
| [rank7]: File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/torch/autograd/function.py", line 306, in apply
| [rank7]: return user_fn(self, *args)
| [rank7]: File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 543, in backward
| [rank7]: ctx.param_group.post_backward()
| [rank7]: File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 338, in post_backward
| [rank7]: self._post_reduce_event, self._partial_reduce_output = foreach_reduce(
| [rank7]: File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
| [rank7]: return func(*args, **kwargs)
| [rank7]: File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/torch/distributed/_composable/fsdp/_fsdp_collectives.py", line 245, in foreach_reduce
| [rank7]: fsdp_param.sharded_param.grad = new_sharded_dtensor_grad
| [rank7]: RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device
0%| | 0/3251 [00:55<?, ?it/s]
| W0830 14:37:46.674000 140676060272448 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2740966 closing signal SIGTERM
| W0830 14:37:46.674000 140676060272448 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2740969 closing signal SIGTERM
| E0830 14:37:56.528000 140676060272448 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2740964) of binary: /tmp/workspace/cuda/results/venv/torch/bin/python
Btw when you tested were you on PyTorch nightlies?
i was testing on a 1-week old pytorch build
@Delaunay thanks, let me try to figure out what's happening here. @weifengpy can you share a pip list from your run? I will try to run on your env to see if the error goes away
@weifengpy can you share a pip list from your run? I will try to run on your env to see if the error goes away
I just updated my pytorch to latest. Will try again today and share env if it works
I can repro in latest pytorch. will debug it today
landing torchtune side fix by moving state dict to cpu (and save memory): https://github.com/pytorch/torchtune/pull/1495
optionally, I need to work with Andrew on pytorch side fix so we can use state dict on gpu https://github.com/pytorch/pytorch/pull/135179
reopen since it’s automatically closed by torchtune PR
As in the title.. I spent a bit of time debugging it but haven't figured out the cause yet. E.g. running
gives the error
We aren't passing the CPU offload policy to the top-level sharding of the model here, which also needs to be fixed. But even after this change, the error persists. Probably missing something obvious here..
Once we fix, we need to add a test for CPU offload as well to catch this for next time.