pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.29k stars 115 forks source link

Converting to checkpoint.pd is not working #307

Closed viai957 closed 1 month ago

viai957 commented 2 months ago

I did follow all the instructions mentioned in the checkpoint.md after running this command successful the checkpoint.pt file was not created i did search the whole dir I did not find it anywhere python -m torch.distributed.checkpoint.format_utils dcp_to_torch torchtitan/outputs/checkpoint/step-500 checkpoint.pt Converting checkpoint from torchtitan/outputs/checkpoint/step-500 to checkpoint.pt using method: 'dcp_to_torch'

image
viai957 commented 2 months ago

Can anybody help me with this

chrisociepa commented 2 months ago

It's a bug in the torch.distributed.checkpoint.format_utils and now it's already fixed in the main branch ( https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/format_utils.py#L265 ). The problem was casued by missing .value in elif args.mode == FormatMode.DCP_TO_TORCH.value:.

I use my own script for the conversion that is a little more customized. You can find it here: https://github.com/chrisociepa/allamo/blob/fsdp2/scripts/convert_dcp.py

viai957 commented 1 month ago

Ohh I see @chrisociepa Thank you

XinDongol commented 1 month ago

Had the same issue here and @chrisociepa's script is useful to me.

wz337 commented 1 month ago

As @chrisociepa mentioned, the fix(https://github.com/pytorch/pytorch/pull/1234070) is landed in main. Therefore, closing the issue.