pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.6k stars 179 forks source link

float8 training: move module attribute setting to sync function #1341

Closed vkuzo closed 8 hours ago

vkuzo commented 13 hours ago

Summary:

This PR moves the setting of is_amax_initialized flag on Float8Linear to the sync_float8_amax_and_scale_history function.

There are two reasons for this:

  1. the current logic does not work with torchtitan + delayed scaling + AC, failing with https://gist.github.com/vkuzo/70819a2cffb9346bf44ecd9079b8bf51 .
  2. in general, stateful logic such as changing module attributes adds complexity. Even if we fix (1) in compile land, something else could break.

The sync_float8_amax_and_scale_history function is already called outside of the main model forward/backward, it's already required to be called at every iteration, it does not need to know about AC, and it seems like a great place to stash logic which isn't easily compileable such as this init code.

After this PR the enable_amax_init and enable_pre_and_post_forward config options are now no-ops. In a future PR we should add a deprecation warning, and eventually remove these.

Test Plan:

// this repo
./test/float8/test_everything.sh

// torchtitan
// requires https://github.com/pytorch/torchtitan/pull/698
with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.compile --float8.scaling_type_input delayed --float8.scaling_type_weight delayed --float8.scaling_type_grad_output delayed

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot[bot] commented 13 hours ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1341

Note: Links to docs will display an error until the docs builds have been completed.

:x: 1 New Failure

As of commit 1ade9c854da5242c05b6e24ce08892c8d5303f4e with merge base 2843388de0ba5ae5af8891ad000178e1e57e731e (image):

NEW FAILURE - The following job has failed:

* [Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job](https://hud.pytorch.org/pr/pytorch/ao/1341#33493705866) ([gh](https://github.com/pytorch/ao/actions/runs/12015505771/job/33493705866)) `RuntimeError: Command docker exec -t 760beda19f43769fee08feb3dc5aba4a686564f8f9d57497d9f2985a9c720bcf /exec failed with exit code 2`

This comment was automatically generated by Dr. CI and updates every 15 minutes.