FLUX NF4 cannot generate image on AMD GPU

M4TH1EU commented 2 months ago

When trying to generate an image, i get this error :

File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/modules_forge/main_thread.py", line 30, in work
    self.result = self.func(*self.args, **self.kwargs)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/modules/txt2img.py", line 110, in txt2img_function
    processed = processing.process_images(p)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/modules/processing.py", line 813, in process_images
    res = process_images_inner(p)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/modules/processing.py", line 956, in process_images_inner
    samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/modules/processing.py", line 1327, in sample
    samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/modules/sd_samplers_kdiffusion.py", line 234, in sample
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/modules/sd_samplers_common.py", line 272, in launch_sampling
    return func()
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/modules/sd_samplers_kdiffusion.py", line 234, in <lambda>
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
  File "/home/mathieu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/k_diffusion/sampling.py", line 594, in sample_dpmpp_2m
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/modules/sd_samplers_cfg_denoiser.py", line 186, in forward
    denoised, cond_pred, uncond_pred = sampling_function(self, denoiser_params=denoiser_params, cond_scale=cond_scale, cond_composition=cond_composition)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/backend/sampling/sampling_function.py", line 339, in sampling_function
    denoised, cond_pred, uncond_pred = sampling_function_inner(model, x, timestep, uncond, cond, cond_scale, model_options, seed, return_full=True)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/backend/sampling/sampling_function.py", line 284, in sampling_function_inner
    cond_pred, uncond_pred = calc_cond_uncond_batch(model, cond, uncond_, x, timestep, model_options)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/backend/sampling/sampling_function.py", line 254, in calc_cond_uncond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/backend/modules/k_model.py", line 45, in apply_model
    model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/backend/nn/flux.py", line 402, in forward
    out = self.inner_forward(img, img_ids, context, txt_ids, timestep, y, guidance)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/backend/nn/flux.py", line 359, in inner_forward
    img = self.img_in(img)
  File "/home/mathieu/Documents/Local/AI/ai-suite-rocm/stablediffusion-forge-rocm/webui/backend/operations.py", line 112, in forward
    return torch.nn.functional.linear(x, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4032x64 and 1x98304)
mat1 and mat2 shapes cannot be multiplied (4032x64 and 1x98304)

I'm on Fedora40 with an RX6800XT, latest PyTorch 2.2.0 ROCM 6.1. SD and SDXL work perfectly but I cannot get the Flux NF4 model to work.

M4TH1EU commented 2 months ago

Someone else also seems to have this issue on Intel GPUs : https://github.com/lllyasviel/stable-diffusion-webui-forge/discussions/981#discussioncomment-10352096

derfasthirnlosenick commented 2 months ago

My hunch: NF4 requires BitdandBytes, which for now is nVidia specific - whereas FP8 is relatively simple and rubs out of the box. The good news is that Bitsandbytes is working on a Multi-Platform refactor, i.e. AMD might be working soon-ish (https://github.com/ROCm/bitsandbytes/tree/rocm_enabled)

edit: got the amd alpha BnB working, but NF4 still fails BC it's expecting CUDA. Might look into it later this week. Not sure of the FP8 got a speedup or if that's just perceived/wishful thinking.

M4TH1EU commented 2 months ago

According to ROCM 6.2.0 release notes, it seems like bitsandbytes is now supported officially ? https://rocm.docs.amd.com/en/latest/about/release-notes.html#memory-savings-for-bitsandbytes-model-quantization

Did you just build BnB from their repo and installed it into your SD venv @derfasthirnlosenick ?

derfasthirnlosenick commented 2 months ago

Yeah I did, a few minutes before the release :D will try the official release when I got the time. Also waiting for the official 6.2 pytorch nightly wheel

M4TH1EU commented 2 months ago

Even with BnB installed I get the same error with the FP8 model, would you mind sharing some more details about how you did it ?

derfasthirnlosenick commented 2 months ago

NF4 I didn't get running, only fp8 (also w.o. BnB)

M4TH1EU commented 2 months ago

NF4 I didn't get running, only fp8 (also w.o. BnB)

I cannot make the FP8 model run on my setup, I've downloaded this model from the Discussion post.

Running with :

Results in RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x59136 and 768x768)

derfasthirnlosenick commented 2 months ago

Managed to break my venv by trying again :-D got it back working w.o. bitsandbytes, will try again later and report back.

M4TH1EU commented 2 months ago

So after tinkering I managed to get it to work, I think.

I built BnB from source using this homemade script here. I've also included a built wheel of BnB built with ROCM 6.1.2 if that can help someone.

Then I installed BitsandBytes into my StableDiffusion VENV and the NF4 model seems to run using bnb_nf4.

~~I am getting abysmal speeds though, with 42s/it on a RX6800XT so something might not be working as expected.~~ edit: somehow now reduced to 11s/it, still not great...

derfasthirnlosenick commented 2 months ago

Don't have proper access to my deskto right now so can't test, but in case you're feeling adventurous, there's a ROCm version of xformers as well: https://github.com/ROCm/xformers/ That one apparently doesn't work well even on gfx1100 (7000 series), so I'm pretty sure it won't work on 1030. But hey, maybe it does :D Also, ROCm6.2 pytorch nightly wheels should be out (soon) as per https://github.com/pytorch/pytorch/pull/133238 which should play nicely with the BnB (see notes at the end of the release notes: https://rocm.docs.amd.com/en/latest/about/release-notes.html)

derfasthirnlosenick commented 2 months ago

Ok, can confirm the nf4 working on my 6800xt with the BnB alpha compiled from source myself. Also running Rocm 6.2 with the newest pytorch nightly (hint: only torch, torch vision is still on 6.1 for so e reason). I'm getting about 12-13s/it (though I'm throttling the GPU a bit BC of thermals).

edit: for completeness' sake, fp8 is giving me about 13s/it, so nf4 is marginally faster.

edit2: and for clarification: i compiled the BnB repo, not the rocm. If you own a 6xxx series, make sure to adjust the GPU arch flag to gfx1030 (or whatever is appropriate for yours).

M4TH1EU commented 2 months ago

I guess we can close this then, the fix is compiling bitsandbytes manually until they merge the multi-backend-refactor branch into the main one.

Instructions below :

# Tested on Ubuntu 22.04
# Recommended to run inside a docker image to not mess with local system

pip3 install --upgrade pip wheel setuptools
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/

# ROCM bitsandbytes build requirements
apt-get install -y hipblas hipblaslt hiprand hipsparse hipcub rocthrust-dev
## Clone repo and install python requirements
git clone --depth 1 -b multi-backend-refactor https://github.com/bitsandbytes-foundation/bitsandbytes.git
cd bitsandbytes
pip3 install -r requirements-dev.txt

## Build
cmake -DCOMPUTE_BACKEND=hip -S . -DBNB_ROCM_ARCH=gfx1030 # adapt this with your gpu arch
make
python3.10 setup.py bdist_wheel --universal

Easy to use docker image ready here Pre-built wheel of BnB for ROCM as of 20/08/2024 for python3.10 here

ronidee commented 2 months ago

Can anyone tell me, which model exactly is meant by the fp8 version? Is there an official version? When I search for it I only find this one https://huggingface.co/Kijai/flux-fp8.

Thanks in advance!

derfasthirnlosenick commented 2 months ago

Can anyone tell me, which model exactly is meant by the fp8 version? Is there an official version? When I search for it I only find this one https://huggingface.co/Kijai/flux-fp8.

Not official, but by lllyasviel, so...

Thanks in advance!

https://huggingface.co/lllyasviel/flux1_dev/blob/main/flux1-dev-fp8.safetensors

ronidee commented 2 months ago

As a noob to pytorch ROCm and bitsandbytes I gotta ask: does/will it work with ROCm 6.2 as well? I'm running Ubuntu 24 with 6.2, with a RX 7800 XT (gfx1101, override to 11.0.0/gfx1100).

I installed the pre-built wheel @M4TH1EU provided (thank you!) into the venv of my working stable diffusion webui forge. I renamed [...]libbitsandbytes_rocm61.so to [...]libbitsandbytes_rocm62.so because it looked for 62.so suffix, sinceI'm on 6.2. However, when trying to generate an image, right after the tqdm percent/progress bar appears, I receive HIP error: invalid device function. Is this because my RX 7800 XT isn't officially supported or because I'm using the wrong bitsandbytes version?

I know this issue is closed, but since you have a similar card and very similar problem, I'd try my luck here. Thank you very much in advance!

PS: part of the longer error trace when using AMD_LOG_LEVEL=3:

hipLaunchKernel: Returned hipErrorInvalidDeviceFunction : 
:3:hip_error.cpp            :44  : 11197833479 us: [pid:38817 tid:0x79f0017a8740] [32m hipPeekAtLastError (  ) [0m
:3:hip_error.cpp            :46  : 11197833481 us: [pid:38817 tid:0x79f0017a8740] hipPeekAtLastError: Returned hipErrorInvalidDeviceFunction : 
Error invalid device function at line 96 in file /tmp/bitsandbytes/csrc/ops.hip

derfasthirnlosenick commented 2 months ago

It works with 6.2, but you need to compile it for your device. @M4TH1EU wheel is for python 3.10 and gfx1030 (6xxx series), whereas you need to compile for gfx1100.

Takes a few mins, just adjust the above code to your setup. (Had the same issue here BC I messed up and compile for the wrong arch - went away after recompiling with the right target)

ronidee commented 2 months ago

Thank you for your blazingly fast responses :p I've already adjusted the Dockerfile and will report back so that others with the same problem can potentially benefit from it.

Update: it actually worked flawlessly! Tried to build it before without the docker image and just got errors over errors. Using the provided scripts (build and extract) worked. Flux nf4 runs now, thank you! :) Incase anyone is wondering what they need to do:

clone this: https://github.com/M4TH1EU/ai-suite-rocm-local/tree/main.
Go to bitsandbytes-rocm-build folder and open the file Dockerfile.
Replace line 1: FROM rocm/dev-ubuntu-22.04:6.1.2 e.g. with FROM rocm/dev-ubuntu-22.04:6.2, depending on your rocm version. (use apt show rocm to check rocm version. Also, see the available tags).
Replace in line 3 and 4 gfx1030 and rocm6.1 with your version, e.g. gfx1100 and rocm6.2. Check gfx version with rocminfo | grep gfx, check torch version with pip show torch.
To avoid confusion, replace :6.1.2 in build.sh and extract_build.sh with your version, e.g. 6.2 as in my case.

Thank you very much @M4TH1EU for doing all the work :+1:

PS: lustiger name ;-)

lllyasviel / stable-diffusion-webui-forge

FLUX NF4 cannot generate image on AMD GPU #1269