Closed OrenLeung closed 1 week ago
Hi @OrenLeung , thanks for the repro! This looks like a bug in how we handle delayed scaling + autocast, let me take a look.
thanks again for the report, https://github.com/pytorch/ao/pull/1306 should fix this. With that PR on my H100 machine:
@vkuzo Thanks for the quick fix!
I am guessing you did your benchmark the 500W h100 version?
I can confirm the fix using #1306 ! I am seeing the following:
I am guessing you did your benchmark the 500W h100 version?
Yes, that's correct.
closing since the fix landed
Hi Torch Team,
I am currently experimenting with native torch float8 training & comparing it to the Transformer Engine using the delayed scaling recipe on GPT 1.5B at batch=12 seq=1024 on 700W H100 SXM 80G SKU.
I see that fp8 transformer engine provides slight perf include compared to autocast bf16 but unfortunately torchao.float8 is almost 2x slower. I attempted to improve performance by trying to enable fp8 & using bf16 autocast at the same time but unfortunately I ran into
ValueError: All layers must have the same last seen input_dtype, got {torch.float32, torch.bfloat16}
error. enabling fp8 & using bf16 autocast is something that TE does but not sure if it is needed for torchao.Can you provide some guidance on how to improve performance on torchao.float8?
Thanks!
Reprod Script
Dependencies