pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.61k stars 179 forks source link

[ROCm] Unable to Run FPX Weights #967

Open Beinsezii opened 1 month ago

Beinsezii commented 1 month ago

Compiling ao from source using pip install git+https://github.com/pytorch/ao.git results in a very fun throw

NotImplementedError: Could not run 'torchao::quant_llm_linear' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchao::quant_llm_linear' is only available for these backends: [Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

when running FPX weights using the script below

import torch
from diffusers.pipelines.stable_diffusion_xl.pipeline_stable_diffusion_xl import StableDiffusionXLPipeline
from torchao.quantization import fpx_weight_only, quantize_

@torch.no_grad()
def main():
    pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
    quantize_(pipe.unet, fpx_weight_only(3, 2))
    pipe(
        prompt="high resolution dslr photograph of a kitten in a field of flowers",
        negative_prompt="blurry, noisy, cropped",
        num_inference_steps=20,
        guidance_scale=5,
        seed=0,
    ).images[0].save("fp6.png")

if __name__ == "__main__":
    main()

Setup is 1x 7900XTX on torch 2.5+rocm62. All other quantizations work just fine, with the exception of float8_dynamic_activation_float8_weight because gfx11 currently does not implement torch's _scaled_mm() function

Using bfloat16 as the base dtype instead actually does run but it's wicked slow from conversions. The floatx readme states to use float16 so I assume that's the correct way.

Python traceback traceback.txt

gau-nernst commented 1 month ago

FPx quantization is backed by a custom CUDA kernel, so it is not available to ROCm.

https://github.com/pytorch/ao/tree/main/torchao/csrc/cuda/fp6_llm

It's strange that it runs with bfloat16 though, so perhaps it is slow precisely because it doesn't use the CUDA kernel. I don't know ROCm well enough, but maybe it's not so hard to port it to ROCm.

Beinsezii commented 1 month ago

It actually compiles something when I install from source. I see 5 threads light up. I thought torch used the hipify script for C extensions to try and auto convert code? Usually if something isn't supported by ROCm though it'll be caught when the wheel builds I thought. Additionally the error is different when using the source compiled or pip wheel. I can fetch the pip version later but it's a lot more boring essentially just saying that the function doesn't exist.

gau-nernst commented 1 month ago

Interesting. I don't know much about how PyTorch handle building for ROCm.

Can you run this script? https://github.com/pytorch/ao/blob/main/benchmarks/benchmark_fp6.py

It will help to verify if you can run the FPx kernel correctly.

Beinsezii commented 1 month ago

Same exact traceback as my original post.

The one example I know of it working on both rocm and cuda is exllama. It uses torch cpp_extensions in ext.py and the file list is a pretty good chunk of cpp/cu sources. Combing through the code there's almost no hip/rocm specific code as the hipify script will swap out all references to libraries like CUBlas for the rocm equivalents.