pytorch / ao

PyTorch native quantization and sparsity for training and inference

BSD 3-Clause "New" or "Revised" License

1.6k stars 179 forks source link

Enable CPU Offload for Intel GPU #1324

Closed dbyoung18 closed 3 hours ago

dbyoung18 commented 4 days ago

Background

Current CPU Offload in torchao only supports CUDA backend. We would like to add support for Intel GPU with the device option "xpu".

Details

add "device" attribute to CPUOffloadOptimizer, default setting to "cuda"
enhance and verify UT test_optim_cpu_offload_correctness & test_optim_cpu_offload_save_load pass on Intel GPU
add "device" argument to benchmark_low_bit_adam.py. Users can use "--device xpu" to benchmark CPU Offload on Intel GPU. Currently it supports both full BF16 and BF16 AMP training w/ eager and compiled mode. Verified workloads on Intel GPU achieve memory saving and interleaving as expected as the description in reference PR:ao#584

pytorch-bot[bot] commented 4 days ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1324

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit 03ac00f5f1d5a86de1e2dd36f7431ac6556291e7 with merge base 478d15b6b7d83aaadfafd07bda18d66399e1c2e1 (): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

gau-nernst commented 1 day ago

@dbyoung18 Can you run ruff format and push the formatted code? CUDA nightly is failing because of bitsandbytes calling triton.ops (~~I think later versions of triton doesn't have triton.ops anymore~~ https://github.com/bitsandbytes-foundation/bitsandbytes/pull/1413). It's not related but not sure if we can merge until that is fixed 😢. I think other PRs will be affected too.

Otherwise, everything else looks good already!

dbyoung18 commented 1 day ago

@dbyoung18 Can you run ruff format and push the formatted code? CUDA nightly is failing because of bitsandbytes calling triton.ops (~I think later versions of triton doesn't have triton.ops anymore~ bitsandbytes-foundation/bitsandbytes#1413). It's not related but not sure if we can merge until that is fixed 😢. I think other PRs will be affected too.

Otherwise, everything else looks good already!

Done for ruff format. Hopes the bnb issue could be resolved soon. THX again for ur review and quick feedback:)

gau-nernst commented 5 hours ago

@dbyoung18 Can you merge from main? #1343 should fix the bnb issue.

Also, can you also update the doc here? https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload

After that we are good to merge :smiley:

dbyoung18 commented 4 hours ago

@dbyoung18 Can you merge from main? #1343 should fix the bnb issue.

Also, can you also update the doc here? https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload

After that we are good to merge 😃

Done for both. We have a plan to gradually support torch-ao & pytorch core on Intel GPU. For this PR it covers CPU Offload only and I will look into the remain part of low-bit optimizers for next step. Since meanwhile we are also on the way to upstream FlashAttention backend to pytorch core(target v2.6 or v2.7), would like to add benchmark data to the README when it's ready. So currently, I only modify the README to make the CPU-Offload part to cover XPU scope. THX for review and I am also looking forward to make further contributions soon.😃

gau-nernst commented 4 hours ago

Sounds good! The low-bit optimizers rely entirely on the tensor subclass + torch.compile() stack, so as long as there is a triton build that supports XPU backend, it should work out-of-the-box!