Open OrenLeung opened 1 week ago
Hi @OrenLeung , I also repro this. We haven't worked on enabling float8 + compile + DDP yet as we found that FSDP is significantly more common in jobs which are large enough to benefit from float8 training. Wondering if you are open to FSDP with NO_SHARD
instead of DDP? Context: https://discuss.pytorch.org/t/difference-between-ddp-vs-fsdp-no-shard/209729
Hi Torch Team,
I am currently experimenting with native torch float8 distributed training using the delayed scaling recipe on GPT 1.5B with DDP at batch=12 seq=1024 on an HGX 8xH100 (700W H100 SXM 80G SKU).
Currently, I am running into a
DDP
+torch.compile
+float8
bug. Without enabling torch.compile it don't run into this error. I have tried using #1306 as well as main@latest Attached below is a self contained reprod & the Error Trace.Commands
Error Trace
Reprod Script
Torch Versions