pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.28k stars 115 forks source link

Add 3D support #344

Closed wconstab closed 3 weeks ago

wconstab commented 1 month ago

Stack from ghstack (oldest at bottom):

Enables PP+DP+TP and adds CI test case that runs on 8-gpu CI runner.

wanchaol commented 4 weeks ago

looks like 8GPU CI failed?

[rank0]:[rank0]:     all_local_plans, global_metadata = planner.create_global_plan(all_local_plans)
[rank0]:[rank0]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/torch/distributed/checkpoint/default_planner.py", line 121, in create_global_plan
[rank0]:[rank0]:     raise ValueError("Failed to validate global plan")

This looks like a DCP failure?

fegin commented 4 weeks ago

CI is still not happy even with https://github.com/pytorch/torchtitan/pull/360. Not sure what's going on, I could not reproduce the issue with the local machine.

wconstab commented 4 weeks ago

sorry for the noise here. the DCP failure is the one i fixed in the FSPD mesh PR. I thought it would be included in last night's pytorch nightly 0604 since it landed yesterday, but it was landed too late in the day.

The 0605 nightly should fix this. I have also updated this PR to include one more test, which reloads the saved PP checkpoint at step 10 and runs to step 20. This validates the additional optimizer flattening logic that is in @fegin's #360 PR lower in this stack.

If CI passes tmrw then I propose to land both PRs.