Segfault with AMD / ROCm on NixOS stable 23.05 in stable-diffusion-webui, probably due to conflicting Mesa build

icodeforyou-dot-net commented 10 months ago

After cloning https://github.com/AUTOMATIC1111/stable-diffusion-webui I tried to run it with ROCm using nix develop .#rocm. However the webui crashes as soon as it gets up and running with segmentation fault.

In the shell I get an error DRI driver not from this Mesa build ('23.0.3' vs '23.1.9') which would indicate some incompatible build of Mesa being present. My main system is running the stable branch of NixOS, currently 23.05. So that is where Mesa build 23.0.3 would come from. Any ideas why 23.1.9 is there? Presumably because the flake pulls in the latest nixpkgs?

Any ideas how to get around this? Help is appreciated. :smile:

Full output here:

################################################################
Launching launch.py...
################################################################
ldconfig: Can't open cache file /nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/etc/ld.so.cache
: No such file or directory
Cannot locate TCMalloc (improves CPU memory usage)
Python 3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.3.0]
Version: v1.6.0
Commit hash: 5ef669de080814067961f28357256e8fe27544f4
Launching Web UI with arguments: 
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Calculating sha256 for /home/ap/Coding/python/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors: Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 6.7s (prepare environment: 2.1s, import torch: 2.1s, import gradio: 0.6s, setup paths: 0.5s, other imports: 0.4s, load scripts: 0.4s, create ui: 0.4s, gradio launch: 0.1s).
Opening in existing browser session.
DRI driver not from this Mesa build ('23.0.3' vs '23.1.9')
failed to bind extensions
DRI driver not from this Mesa build ('23.0.3' vs '23.1.9')
failed to bind extensions
DRI driver not from this Mesa build ('23.0.3' vs '23.1.9')
failed to bind extensions
DRI driver not from this Mesa build ('23.0.3' vs '23.1.9')
failed to bind extensions
6ce0161689b3853acaa03779ec93eafe75a02f4ced659bee03f50797806fa2fa
Loading weights [6ce0161689] from /home/ap/Coding/python/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Creating model from config: /home/ap/Coding/python/stable-diffusion-webui/configs/v1-inference.yaml
./webui.sh: line 255: 144988 Segmentation fault      (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"

nonetrix commented 7 months ago

Same here, also struggling to get it to work in Distrobox because Distrobox seems to be broken due to it not being HFS despite the fact it's in the NixOS repos

rastarr commented 5 months ago

sadly, same here. is there any fix for this please?

icodeforyou-dot-net commented 5 months ago

@rastarr my fix is called OCI containers.

I got this to work fairly reliably: https://github.com/ai-dock/stable-diffusion-webui

You just have to disable sd_dreambooth_extension and bitsandbytes in the provisioning script and pass the appropriate ENV variable via docker-compose to tell ROCm which GPU arch you are running, and it worked for me.

rastarr commented 5 months ago

I did some digging and found this -

Using rocm version 5.5.0 fixed segfault for me (RX 580):

Re initalize your venv
Enter this: TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.5.0'
Run python3 launch.py --precision full --no-half --opt-sub-quad-attention --lowvram --disable-nan-check --skip-torch-cuda-test
In this case webui.sh does not need to be touched

How do we go about changing the flake to use rocm5.5.0 ? Any help would be greatly appreciated

icodeforyou-dot-net commented 5 months ago

To be fair, I'd rather not use an old version of ROCm in the flake itself. Generally something newer (5.7. is the most recent I am running right now, but will test 6.0 soon) works fine for me so that should work fine here as well. However if you want to do the change locally, I'd suggest you find out which checkout of nixpkgs gives you ROCm 5.5 and use that. A big issue is that depending on how you do it, you might have to recompile everything. Which won't be fun.

rastarr commented 5 months ago

well, i actually changed the runtime in the flake to rocmPackages_5.rocm-runtime to lock it to 5.7 then entered the shell and did ' export TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.5.0' '

now running my 8G RX580 with ' ./webui.sh --precision full --no-half --opt-sub-quad-attention --lowvram --disable-nan-check --skip-torch-cuda-test --share ' and all is working So that's great

virchau13 / automatic1111-webui-nix

Segfault with AMD / ROCm on NixOS stable 23.05 in stable-diffusion-webui, probably due to conflicting Mesa build #9