pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.57k stars 143 forks source link

Validate 1D FSDP2 parity with FSDP1 #106

Closed gnadathur closed 5 months ago

gnadathur commented 6 months ago

Validate QPS and numerics parity on 8 GPU devGPU and 64 GPU AWS.

awgu commented 6 months ago

On 8 H100 GPUs:

awgu commented 5 months ago

Full numerics validation on 8 GPUs is included in https://github.com/pytorch/torchtrain/pull/165.