pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.29k stars 115 forks source link

Add Pipeline Parallel (and 2D PP+FSDP) support #318

Closed wconstab closed 1 month ago

wconstab commented 1 month ago

Stack from ghstack (oldest at bottom):

runs PP+DP and PP+TP without issue, runs PP+TP+DP with decreasing loss, but fails DCP save

Supports only simple schedules currently, gpipe and 1f1b.

Ads cmdline/toml arg for specifiying split points, in a unified way between tracer or manual frontend.

e.g. user can specifiy "layers.2,layers.4" as split points.

Currently uses manual frontend by default, but allows specifying tracer frontend. Tracer frontend requires working around additional compatibility limitations, indicated by raising assertions, and is not ready for wider use yet.

wanchaol commented 1 month ago

@wconstab looks like CI is failing now, is it because the APIs for PP not in nightly yet? If so we should probably wait until the nightly is there and then reland this