Closed vkuzo closed 3 months ago
Stack from ghstack (oldest at bottom):
Summary:
FSDP already ensures that each rank receives the same weight, so the amaxes of weights are the same on each rank.
I checked performance before/after on the multi GPU benchmark and didn't see a significant impact on the toy model, but less comms value is better.
Test Plan:
./test_everything.sh passes
Reviewers:
Subscribers:
Tasks:
Tags:
recreated in https://github.com/pytorch-labs/float8_experimental/pull/277 to get around ghstack weirdness
Stack from ghstack (oldest at bottom):
273
271
Summary:
FSDP already ensures that each rank receives the same weight, so the amaxes of weights are the same on each rank.
I checked performance before/after on the multi GPU benchmark and didn't see a significant impact on the toy model, but less comms value is better.
Test Plan:
./test_everything.sh passes
Reviewers:
Subscribers:
Tasks:
Tags: