Closed wz337 closed 3 weeks ago
I am curious if we have any experiments to see the performance difference with fused=True
.
I am curious if we have any experiments to see the performance difference with
fused=True
.
Gonna talk to Tianyu to learn how to run the perf experiments on the new 128 GPUs. This is just adding it to the config to allow it, but the default behavior is still foreach=True.
We can totally wait for the result before landing this.
can we add some 8 GPU numbers at least? 128 GPU can be done separately
@wanchaol @awgu Added performance diff in the summary. I think we are comfortable offering this option in torchtitan?
this PR (foreach=true
) shortened opt.step
from 2000ms to 200ms. That's +10% e2e QPS on 16 H100 node (16 x 8 GPUs). I might need to refresh 1D and 2D benchmark base on this @drisspg
@weifengpy foreach=True
used to be the default, so perhaps your package was before https://github.com/pytorch/torchtitan/pull/386 landed. Without https://github.com/pytorch/torchtitan/pull/386, the optimizer would fall back to foreach=False
when fused=False
. 2000 ms for optimizer step sounds like foreach=False
.
@weifengpy
foreach=True
used to be the default, so perhaps your package was before #386 landed. Without #386, the optimizer would fall back toforeach=False
whenfused=False
. 2000 ms for optimizer step sounds likeforeach=False
.
Ah got you. I checked the trace and 2000ms indeed comes from foreach=False
With these three PRs landed, we can now support the option fused=True in torchtitan for Adam and AdamW optimizer.
https://github.com/pytorch/pytorch/pull/125369 https://github.com/pytorch/pytorch/pull/126423 https://github.com/pytorch/pytorch/pull/126750
Run performance evaluation on 8 A100 DevGPU: 1000 steps on 1D DP default llama_8b.toml.
Observation: For
fused = True
andfused = False
, we observed similar loss curve and memory usage. wps is + ~100 and mfu is + 1.5-2% when fused = True.Below are the logs for the last 100 steps for both.