pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.6k stars 179 forks source link

Enable CPU Offload for Intel GPU #1324

Closed dbyoung18 closed 3 hours ago

dbyoung18 commented 4 days ago

Background

Current CPU Offload in torchao only supports CUDA backend. We would like to add support for Intel GPU with the device option "xpu".

Details

pytorch-bot[bot] commented 4 days ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1324

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit 03ac00f5f1d5a86de1e2dd36f7431ac6556291e7 with merge base 478d15b6b7d83aaadfafd07bda18d66399e1c2e1 (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

gau-nernst commented 1 day ago

@dbyoung18 Can you run ruff format and push the formatted code? CUDA nightly is failing because of bitsandbytes calling triton.ops (I think later versions of triton doesn't have triton.ops anymore https://github.com/bitsandbytes-foundation/bitsandbytes/pull/1413). It's not related but not sure if we can merge until that is fixed 😢. I think other PRs will be affected too.

Otherwise, everything else looks good already!

dbyoung18 commented 1 day ago

@dbyoung18 Can you run ruff format and push the formatted code? CUDA nightly is failing because of bitsandbytes calling triton.ops (~I think later versions of triton doesn't have triton.ops anymore~ bitsandbytes-foundation/bitsandbytes#1413). It's not related but not sure if we can merge until that is fixed 😢. I think other PRs will be affected too.

Otherwise, everything else looks good already!

Done for ruff format. Hopes the bnb issue could be resolved soon. THX again for ur review and quick feedback:)

gau-nernst commented 5 hours ago

@dbyoung18 Can you merge from main? #1343 should fix the bnb issue.

Also, can you also update the doc here? https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload

After that we are good to merge :smiley:

dbyoung18 commented 4 hours ago

@dbyoung18 Can you merge from main? #1343 should fix the bnb issue.

Also, can you also update the doc here? https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload

After that we are good to merge 😃

Done for both. We have a plan to gradually support torch-ao & pytorch core on Intel GPU. For this PR it covers CPU Offload only and I will look into the remain part of low-bit optimizers for next step. Since meanwhile we are also on the way to upstream FlashAttention backend to pytorch core(target v2.6 or v2.7), would like to add benchmark data to the README when it's ready. So currently, I only modify the README to make the CPU-Offload part to cover XPU scope. THX for review and I am also looking forward to make further contributions soon.😃

gau-nernst commented 4 hours ago

Sounds good! The low-bit optimizers rely entirely on the tensor subclass + torch.compile() stack, so as long as there is a triton build that supports XPU backend, it should work out-of-the-box!