Closed dbyoung18 closed 3 hours ago
Note: Links to docs will display an error until the docs builds have been completed.
As of commit 03ac00f5f1d5a86de1e2dd36f7431ac6556291e7 with merge base 478d15b6b7d83aaadfafd07bda18d66399e1c2e1 (): :green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@dbyoung18 Can you run ruff format
and push the formatted code? CUDA nightly is failing because of bitsandbytes calling triton.ops
(I think later versions of triton doesn't have https://github.com/bitsandbytes-foundation/bitsandbytes/pull/1413). It's not related but not sure if we can merge until that is fixed 😢. I think other PRs will be affected too.triton.ops
anymore
Otherwise, everything else looks good already!
@dbyoung18 Can you run
ruff format
and push the formatted code? CUDA nightly is failing because of bitsandbytes callingtriton.ops
(~I think later versions of triton doesn't havetriton.ops
anymore~ bitsandbytes-foundation/bitsandbytes#1413). It's not related but not sure if we can merge until that is fixed 😢. I think other PRs will be affected too.Otherwise, everything else looks good already!
Done for ruff format. Hopes the bnb issue could be resolved soon. THX again for ur review and quick feedback:)
@dbyoung18 Can you merge from main? #1343 should fix the bnb issue.
Also, can you also update the doc here? https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload
After that we are good to merge :smiley:
@dbyoung18 Can you merge from main? #1343 should fix the bnb issue.
Also, can you also update the doc here? https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload
After that we are good to merge 😃
Done for both. We have a plan to gradually support torch-ao & pytorch core on Intel GPU. For this PR it covers CPU Offload only and I will look into the remain part of low-bit optimizers for next step. Since meanwhile we are also on the way to upstream FlashAttention backend to pytorch core(target v2.6 or v2.7), would like to add benchmark data to the README when it's ready. So currently, I only modify the README to make the CPU-Offload part to cover XPU scope. THX for review and I am also looking forward to make further contributions soon.😃
Sounds good! The low-bit optimizers rely entirely on the tensor subclass + torch.compile() stack, so as long as there is a triton build that supports XPU backend, it should work out-of-the-box!
Background
Current CPU Offload in torchao only supports CUDA backend. We would like to add support for Intel GPU with the device option "xpu".
Details