[Issue]: Optimize for AMD with ROCm

Soulreaver90 commented 1 year ago

Issue Description

Decided to try this out but can't get far. I was able to run webui.sh, however it did try to install torch2.0 +cu118 even though I have an AMD card and it should have installed ROCM instead. However even after all that, it got hung on "Setting environment tuning". I closed it, installed the rocm torch drivers and reran launcher. It hangs at Setting environment tuning for minutes, and then it still shows torch + cu118 and says CUDA not available.

"18:36:31-613009 INFO Python 3.10.6 on Linux 18:36:31-622754 INFO No changes detected: quick launch active 18:36:31-623268 INFO Setting environment tuning 18:40:36-283387 INFO Torch 2.0.0+cu118 18:40:36-284238 WARNING Torch repoorts CUDA not available 18:40:36-284768 INFO Server arguments: ['--no-half-vae', '--skip-requirements', '--skip-extensions',
'--no-half'] Available models: /home/blah/automatic/models/Stable-diffusion 0 Download the default model? (y/N) Loading theme: black-orange Running on local URL: http://127.0.0.1:7861"

Version Platform Description

Ubuntu 22.04

Soulreaver90 commented 1 year ago

I ended up replacing the torch command in the setup.py file with "torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2" instead of the default cuda install. That works but it still appends --no-half into my arguments, why? and how can I remove it?

EDIT: Nevermind, it did not work. Tried to generate a image, got 6.6s/it. Not sure if its because of the --no-half or its reverting to using my CPU. I get 6-8it/s on Auto1111. EDIT2: I realized that even though I installed torch rocm5.4.2, torch+cu118 was still installed and seen when checking torch.version. I completely removed torch entirely and reinstalled from scratch. 19:43:18-138823 INFO Torch 2.0.0+rocm5.4.2
19:43:20-887382 INFO Torch backend: AMD ROCm HIP 5.4.22803-474e8620
19:43:20-888597 INFO Torch detected GPU: AMD Radeon RX 6700 XT VRAM 12272
Arch (10, 3) Cores 20
19:43:20-889260 INFO Server arguments: []

However Tried to generate a image and got a lengthy error. Progress 4.44it/s ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:02 gradio call: NotImplementedError ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /home/blah/automatic/modules/call_queue.py:61 in f │ │ │ │ 60 │ │ │ │ pr.enable() │ │ ❱ 61 │ │ │ res = list(func(*args, *kwargs)) │ │ 62 │ │ │ if shared.cmd_opts.profile: │ │ │ │ /home/blah/automatic/modules/call_queue.py:39 in f │ │ │ │ 38 │ │ │ try: │ │ ❱ 39 │ │ │ │ res = func(args, **kwargs) │ │ 40 │ │ │ finally: │ │ │ │ ... 17 frames hidden ... │ │ │ │ /home/blah/automatic/venv/lib/python3.10/site-packages/xformers/ops/fmha/d │ │ ispatch.py:98 in _dispatch_fw │ │ │ │ 97 │ │ priority_list_ops.insert(0, triton.FwOp) │ │ ❱ 98 │ return _run_priority_list( │ │ 99 │ │ "memory_efficient_attention_forward", priority_list_ops, inp │ │ │ │ /home/blah/automatic/venv/lib/python3.10/site-packages/xformers/ops/fmha/d │ │ ispatch.py:73 in _run_priority_list │ │ │ │ 72 │ │ msg += "\n" + _format_not_supported_reasons(op, not_supported) │ │ ❱ 73 │ raise NotImplementedError(msg) │ │ 74 │ ╰──────────────────────────────────────────────────────────────────────────────╯ NotImplementedError: No operator found for memory_efficient_attention_forward with inputs: query : shape=(1, 4096, 1, 512) (torch.float16) key : shape=(1, 4096, 1, 512) (torch.float16) value : shape=(1, 4096, 1, 512) (torch.float16) attn_bias : <class 'NoneType'> p : 0.0 cutlassF is not supported because: xFormers wasn't build with CUDA support flshattF is not supported because: xFormers wasn't build with CUDA support max(query.shape[-1] != value.shape[-1]) > 128 tritonflashattF is not supported because: xFormers wasn't build with CUDA support max(query.shape[-1] != value.shape[-1]) > 128 requires A100 GPU smallkF is not supported because: xFormers wasn't build with CUDA support dtype=torch.float16 (supported: {torch.float32}) max(query.shape[-1] != value.shape[-1]) > 32 unsupported embed per head: 512

Soulreaver90 commented 1 year ago

EDIT3: Because my previous post was getting long. Webui.sh did not just install the wrong torch version, it installed xformers which was causing the previous issue. Uninstalled xformers and I could FINALLY generate an image. Clearly this was heavily optimized with Nvidia in mind, but we need some AMD love :( lol Anyway I hope my pain helps with optimizing the AMD workflow. Let me know if you need me to test things out.

Update: Anytime I launch webui.sh, it installs xformers and gives me an error about it. When I tried to generate an image, it gives me the memory error shown in the previous post. Not an issue, a can manually launch launch.py with my args with a separate .sh file.

vladmandic commented 1 year ago

Since I don't have AMD system available, I've asked several times for community to provide best practices and steps - and I'm more than willing to integrate them into core workflow. But I cannot do thar alone.

Soulreaver90 commented 1 year ago

Since I don't have AMD system available, I've asked several times for community to provide best practices and steps - and I'm more than willing to integrate them into core workflow. But I cannot do thar alone.

Unfortunately I’m not a developer who can assist with debugging, but I can take a look. Right now the two issues are as follows:

-Installs cu118 even though AMD is clearly identified (I did check that code in setup.py) -Installs xformers and defaults to using them regardless which optimizer is chosen in setting.

I have to recheck, but I had changed all instances of Torch command, yet it still installs and defaults the torch install to cu118 when I run the sh script, including the install of xformers. Even if I completely uninstall and remove both torch and xformers, it will reinstall and default to using xformers when I run that script. If I manually launch “launch.py”, I get the no xformers message and proceeds with no issues. Otherwise, everything else works flawlessly.

vladmandic commented 1 year ago

i've just modified installer so you can override torch and xformers using environment variables. by default, installer will try to install:

torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu118
xformers==0.0.17

but now, you can uninstall them and install whatever packages you want using:

export TORCH_COMMAND="torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu118"
export XFORMERS_PACKAGE="xformers==0.0.17"

and if you set it to none or no or anything like that, it will not try to install them at all - so whatever you installed (or uninstalled) will remain as such. for example:

uninstall xformers if you already have them: pip uninstall xformers
export TORCH_COMMAND=torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
export XFORMERS_PACKAGE=none

let me know if this works?

mudapanda2 commented 1 year ago

OP, it didn't hang on the environmental tuning portion. There's no visual indicator of what's going on like a progress bar and there's an issue that's causing pip downloads to take ages. You can verify that it's still functioning by using procmon and watching python go nuts with the writefile. Vlad should add a visual aid for the impatient.

This isn't the same wheel but this is what you would see if the window had a visual indicator.

https://user-images.githubusercontent.com/94585670/233135916-02f7ce5a-dc46-4bdf-a9c8-e8de0cfa69e3.mp4

The timer there doing the funky chicken, it's a Windows 11 issue. Still figuring it out but python succs down the data, spits it into a tmp file as well as an pip-unpack folder, meanwhile Windows 11 is writing an entry for every single thing happening there in little bits in the search index db and causing I/O horseshit. There is 169,742 entries written in Windows-Gather.db from this on my end right now, lol. SystemIndex_Gthr is still loading, 20 minutes later, Windows.db is now 1.1gig

Procmon will be an absolute blur of read and writes, registry checks for namespace, it's adding values to the database for each little part of what pip is doing, etc. ERP

Turn off search indexing,

vladmandic commented 1 year ago

Vlad should add a visual aid for the impatient

I'll see what I can do. Mostly, this affects torch installation as that is the only package that is huge (2+ gb)

mudapanda2 commented 1 year ago

"This may take time please be patient, like lots of time, go outside and smell a tree."

Just throw that in there, Windows isn't supposed to be indexing this stuff so no idea why its happening I just checked mine and all areas I've got were marked excluded but still, maybe because the venv's end up being 62,725 files or so and those AREN'T excluded by default afaik.

Soulreaver90 commented 1 year ago

i've just modified installer so you can override torch and xformers using environment variables. by default, installer will try to install:

torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu118

xformers==0.0.17

but now, you can uninstall them and install whatever packages you want using:

export TORCH_COMMAND="torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu118"

export XFORMERS_PACKAGE="xformers==0.0.17"

and if you set it to none or no or anything like that, it will not try to install them at all - so whatever you installed (or uninstalled) will remain as such. for example:

uninstall xformers if you already have them: pip uninstall xformers

export TORCH_COMMAND=torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2

export XFORMERS_PACKAGE=none

let me know if this works?

Okay so first things first. The environmental variables work, when I set them and relaunch launch.py, it loads up rocm5.4.2 just fine and there are no xformer messages shown as before. If I uninstall xformers manually and try again, I do get the "no xformers found" message as expected, so "seems" to be fine, however I could not test image generation.

If I load the latest commit and not do anything at all, it will load cu117+xformers and I can launch webui with no issue.
If I load the latest commit and run the new env variables, it loads rocm5.4.2. However, it fails at loading my model (a custom merged ckpt) with the following Applying scaled dot product cross attention optimization Segmentation fault (core dumped). It gets stuck at the applying portion for a solid minute or so before failing.
If I load the latest commit, uninstall xformers torch torchvision and install torch+rocm5.4.2, it still fails at the segmentation fault error.
If I load the commit from the other day (the last one from the 17th as that was the one I was using yesterday), and I uninstall xformers torch torchvision and install torch+rocm5.4.2, it applies sdp immediately and loads my ckpt in seconds.

Not sure if something in the recent commits broke how optimizations are applied. I will add that as much as I like the idea of env variables, it still isn't beginner friendly. It still requires running webui.sh and letting it install cu117, which is a huge waste of time and resources when It should just install rocm5.4.2 at the start. Maybe have a separate setup.py aimed at AMD while a unified version is figured out and implemented. I think its easier to tell someone "hey run webui-amd.sh" as opposed to run these "export commands in so and so" while they look with a blank stare, lol.

vladmandic commented 1 year ago

Scaled dot product is probaby non functional with ROCm, actually never seen anyone mention using it. Not surpring since it's newer than any version of ROCm.

Change cross optimization to something less aggressive, like Doggetx.

Soulreaver90 commented 1 year ago

Scaled dot product is probaby non functional with ROCm, actually never seen anyone mention using it. Not surpring since it's newer than any version of ROCm.

Change cross optimization to something less aggressive, like Doggetx.

I would but that would require being able to access the UI to change those settings, correct? Once sdp fails, it just craps out the entire thing. I can run sdp on the previous day's commit with no issue. Does it do anything? no idea, but it runs and the UI is accessible. With the latest commit, it just fails and ends the entire session.

vladmandic commented 1 year ago

As a workaround, you can edit config.json manually to disable SDP.

And yes on unified installer, I'll add AMD specific stuff into it directly once we know exactly what combo works. Like I said, I don't have AMD system, so I rely on ppl like you to tell me what packages/settings work best.

Soulreaver90 commented 1 year ago

As a workaround, you can edit config.json manually to disable SDP.

And yes on unified installer, I'll add AMD specific stuff into it directly once we know exactly what combo works. Like I said, I don't have AMD system, so I rely on ppl like you to tell me what packages/settings work best.

Okay I’ll try it out this afternoon. I’m down to test any AMD related commits. I noticed a speed issue in my last install so I’ll try testing it out, I think I messed up but will see.

Ph0non commented 1 year ago

I have an AMD System with an 6900XT working more or less fine with auto1111. I can later post the versions of ROCm and torch for my system.

Soulreaver90 commented 1 year ago

I have an AMD System with an 6900XT working more or less fine with auto1111. I can later post the versions of ROCm and torch for my system.

Yeah my card works fine with Auto, however the install process for it was complete ass. Vlad’s installer gets significantly farther than Auto’s, although it too fails at installing the correct torch. I’ll add I haven’t tried Auto’s installer in months so not sure if it was improved. it would be nice if the installers can first detect if the appropriate rocm drivers are installed (via amdgpu?), a lot of the newb issues I’ve encountered are from users who simply fired up Linux/Ubuntu and expected it to work out of the box. I also recall an error that only occurs on 22.04 that requires a certain install, I think I have it in my notes.

vladmandic commented 1 year ago

there is good info in #269 - can you confirm before i start making code changes to support it out-of-the-box?

iDeNoh commented 1 year ago

I've got an RX 6700xt running more or less on here, I can provide any info needed as well. So far I'm running with --medvram and doggetx for optimizations. I also enabled upcast sampling because why not. I'm getting 5-6it/s at 512x512 at 20 samples, which is at least on par if not better than base a1111, but with less command line args

vladmandic commented 1 year ago

I've got an RX 6700xt running more or less on here, I can provide any info needed as well. So far I'm running with --medvram and doggetx for optimizations. I also enabled upcast sampling because why not. I'm getting 5-6it/s at 512x512 at 20 samples, which is at least on par if not better than base a1111, but with less command line args

great. command line to install torch is the same?

Soulreaver90 commented 1 year ago

@vladmandic The error I was having yesterday with loading the model is fixed, no sdp issues now. I was briefly having major issues an hour ago but saw you pushed new commits that fixed them.

However, a new issue and a quirk. When I ran webui.sh fresh, it installed torch+cu117. I removed both torch and xformers and installed torch+rocm5.4.2. I went to launch ui with no issues, however I couldn’t generate an image and was presented with the “no operator found for ‘memory_efficient…”, basically xformers was somehow reinstalled and was STILL setting itself as the main optimizer despite me selecting everything else in settings. Ill add i did not try the args you introduced the other day. I uninstalled xformers again and looks good so far now, can generate images. Which leads me to problem #2 ..

When I first got ui working two days ago, I was getting speeds similar to auto1111, around 6.5it/s. But since yesterday and now today, I can’t get anything above 1.3it/s. I’ve tried several settings and combinations. Not sure what’s going on now.

vladmandic commented 1 year ago

@Soulreaver90 env options to disable installing xformers should resolve that issue. for the performance, its most likely because of cuda settings being moved to ui settings as of today, so whatever command line you were using before is being ignored.

Soulreaver90 commented 1 year ago

@Soulreaver90 env options to disable installing xformers should resolve that issue. for the performance, its most likely because of cuda settings being moved to ui settings as of today, so whatever command line you were using before is being ignored.

Okay I’ll check it out again. I blew out the folder and am recloning from scratch. I see it is infact installing rocm5.2 but still shows as torch+cu117 for whatever reason.

EDIT: Installed. Removed xformers, Installed Torch+rocm5.2. Ran launch.py, got the SDP error as yesterday. Reran with my "go-to" args export HSA_OVERRIDE_GFX_VERSION=10.3.0 export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:32, model loaded with no SDP issue. Will try again later to see if its related to the args or some random chance. Maybe its not applying the HSA override by default? Anyway, speeds are back at 6+ so its perfect now. I do see the settings you mentioned, that explains why --no-half-vae was giving me an error.

2023-04-20_17-25

2023-04-20_17-27

iDeNoh commented 1 year ago

@Soulreaver90 env options to disable installing xformers should resolve that issue. for the performance, its most likely because of cuda settings being moved to ui settings as of today, so whatever command line you were using before is being ignored.

Okay I’ll check it out again. I blew out the folder and am recloning from scratch. I see it is infact installing rocm5.2 but still shows as torch+cu117 for whatever reason.

EDIT: Installed. Removed xformers, Installed Torch+rocm5.2. Ran launch.py, got the SDP error as yesterday. Reran with my "go-to" args export HSA_OVERRIDE_GFX_VERSION=10.3.0 export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:32, model loaded with no SDP issue. Will try again later to see if its related to the args or some random chance. Maybe its not applying the HSA override by default? Anyway, speeds are back at 6+ so its perfect now. I do see the settings you mentioned, that explains why --no-half-vae was giving me an error.

fwiw this is what I'm using in my script, it seems to work perfectly for me.

export TORCH_COMMAND=torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 export XFORMERS_PACKAGE=none

Soulreaver90 commented 1 year ago

@Soulreaver90 env options to disable installing xformers should resolve that issue. for the performance, its most likely because of cuda settings being moved to ui settings as of today, so whatever command line you were using before is being ignored.

Okay I’ll check it out again. I blew out the folder and am recloning from scratch. I see it is infact installing rocm5.2 but still shows as torch+cu117 for whatever reason. EDIT: Installed. Removed xformers, Installed Torch+rocm5.2. Ran launch.py, got the SDP error as yesterday. Reran with my "go-to" args export HSA_OVERRIDE_GFX_VERSION=10.3.0 export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:32, model loaded with no SDP issue. Will try again later to see if its related to the args or some random chance. Maybe its not applying the HSA override by default? Anyway, speeds are back at 6+ so its perfect now. I do see the settings you mentioned, that explains why --no-half-vae was giving me an error.

fwiw this is what I'm using in my script, it seems to work perfectly for me.

export TORCH_COMMAND=torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 export XFORMERS_PACKAGE=none

I’m aware and had tested it yesterday with no issues. It’s just not newbie friendly, this is more of a bandaid until the installer can properly install the correct packages by default. At the very least, we confirmed all things work with AMD cards once the install is done properly.

Ph0non commented 1 year ago

there is good info in #269 - can you confirm before i start making code changes to support it out-of-the-box?

installed packages on ubuntu 22.04 lts rocm-core5.25 - version 5.2.5.50205-186 rocm-dgbapi - version 0.65.1.50205-186 rocm-gdb - version 11.2.50.200-65 rocm-hip-runtime5.2.5 - version 5.2.5.50205-186 rocm-language-runtime5.2.5 - version 5.2.5.50205-186 rocm-llvm5.2.5 - version 14.0.0.22324.50205-186 rocm-ocl-icd5.2.5 - version 2.0.0.50205-186 rocm-opencl - version 1.2.0-2018111340 (maybe upgrade to version 2) rocm-opencl-dev - 1.2.0-2018111340 (maybe upgrade to version 2) rocm-opencl-runtime - version 5.2.5.50205-186 rocm-opencl5.2.5 - version 2.0.0.50205-186 rocminfo5.2.5 - version 1.0.0.50205-186 hip-runtime-amd5.2.5 - version 5.2.21153-50205-186

There were some bugs with uninstallable packages (rocm-opencl and rocm-opencl-dev). This was the correct thread I believe https://github.com/RadeonOpenCompute/ROCm/issues/1713).

vladmandic commented 1 year ago

i've just added this to setup:

    if shutil.which('nvidia-smi') is not None:
        log.info('nVidia toolkit detected')
        torch_command = os.environ.get('TORCH_COMMAND', 'torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu118')
        xformers_package = os.environ.get('XFORMERS_PACKAGE', 'xformers==0.0.17')
    elif shutil.which('rocm-smi') is not None:
        log.info('AMD toolkit detected')
        torch_command = os.environ.get('TORCH_COMMAND', 'torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2')
        xformers_package = os.environ.get('XFORMERS_PACKAGE', 'none')
    else:
        log.info('Using CPU-only Torch')
        torch_command = os.environ.get('TORCH_COMMAND', 'torch torchaudio torchvision')
        xformers_package = os.environ.get('XFORMERS_PACKAGE', 'none')

if you can test and let me know? if it works, then we can move on to next stage - what is ideal cross-optimization for amd? i've heard different things...

iDeNoh commented 1 year ago

i've just added this to setup:

    if shutil.which('nvidia-smi') is not None:
        log.info('nVidia toolkit detected')
        torch_command = os.environ.get('TORCH_COMMAND', 'torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu118')
        xformers_package = os.environ.get('XFORMERS_PACKAGE', 'xformers==0.0.17')
    elif shutil.which('rocm-smi') is not None:
        log.info('AMD toolkit detected')
        torch_command = os.environ.get('TORCH_COMMAND', 'torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2')
        xformers_package = os.environ.get('XFORMERS_PACKAGE', 'none')
    else:
        log.info('Using CPU-only Torch')
        torch_command = os.environ.get('TORCH_COMMAND', 'torch torchaudio torchvision')
        xformers_package = os.environ.get('XFORMERS_PACKAGE', 'none')

if you can test and let me know? if it works, then we can move on to next stage - what is ideal cross-optimization for amd? i've heard different things...

I did a test last night and strangely enough, I was getting my best performance with sdp/doggettx. I was under the impression that sdp wasn't supposed to benefit and users at all, though.

vladmandic commented 1 year ago

sdp is only available for torch 2.0, so if you have torch 1.13, even if you select it, it will not activate. you'll see in console log on startup which cross-optimization is activated. also seen in system info tab. also, sdp doesn't benefit users of low-end gpus compared to xformers due to workload split cpu<->gpu and if gpu is semi-decent, sdp is not worse.

Soulreaver90 commented 1 year ago

i've just added this to setup:

    if shutil.which('nvidia-smi') is not None:
        log.info('nVidia toolkit detected')
        torch_command = os.environ.get('TORCH_COMMAND', 'torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu118')
        xformers_package = os.environ.get('XFORMERS_PACKAGE', 'xformers==0.0.17')
    elif shutil.which('rocm-smi') is not None:
        log.info('AMD toolkit detected')
        torch_command = os.environ.get('TORCH_COMMAND', 'torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2')
        xformers_package = os.environ.get('XFORMERS_PACKAGE', 'none')
    else:
        log.info('Using CPU-only Torch')
        torch_command = os.environ.get('TORCH_COMMAND', 'torch torchaudio torchvision')
        xformers_package = os.environ.get('XFORMERS_PACKAGE', 'none')

if you can test and let me know? if it works, then we can move on to next stage - what is ideal cross-optimization for amd? i've heard different things...

It detects AMD and sets torch to the correct path, but torch.version still shows as torch.cu117. I am also getting a ton of assertion errors at start and a ton of attributeerrors after install. I couldn't even generate an image, I got a bunch of RuntimeErrors "LayerNormKernelImpl" not implemented for 'Half'"

When I rerun launch.py manually, it reinstalled torch rocm5.2 and fixed itself. Seems like there is still something in webui.sh that is pushing cu117. This reinstall did fix the RuntimeErrors so I could generate images. However I am getting the slow 1it/s speed I experienced last night, not sure what causes it go that slow when it should hit 6it/s. Baby steps.

Edit: I noticed the initial install shows as --extra-indexl-url, while the fixed version drops extra.

2023-04-21_09-32

2023-04-21_09-50

vladmandic commented 1 year ago

yeah, there was a leftover code in webui.sh that did that, i've removed it. any installation should be done by setup.py, not old webui.sh

Soulreaver90 commented 1 year ago

yeah, there was a leftover code in webui.sh that did that, i've removed it. any installation should be done by setup.py, not old webui.sh

Bingo! That did the trick and installed the correct drivers! That is great progress. Still getting assertion errors at start AssertionError: Couldn't find Stable Diffusion in any of: However the terminal crashes at the very end with Applying scaled dot product cross attention optimization Segmentation fault (core dumped) again. Its on and off with this.

EDIT: When you cleaned up webui.sh, did you check if setup.py or launch.py adds export HSA_OVERRIDE_GFX_VERSION=10.3.0? I ran it and now the sdp error above is fixed. That will be needed for AMD.

Soulreaver90 commented 1 year ago

@vladmandic Following up on the above. You removed the entire gpu preq for AMD instead of just the Torch install portion. Those prereqs are a requirement for AMD cards. I added just the following back to webui.sh and now everything installs and works out of the box, no scaled dot errors. You can still move this code over to setup.py but it just needs to live somewhere and setup once.

gpu_info=$(lspci 2>/dev/null | grep VGA) case "$gpu_info" in *"Navi 1"*|*"Navi 2"*) export HSA_OVERRIDE_GFX_VERSION=10.3.0 ;; *"Renoir"*) export HSA_OVERRIDE_GFX_VERSION=9.0.0 printf "\n%s\n" "${delimiter}" printf "Experimental support for Renoir: make sure to have at least 4GB of VRAM and 10GB of RAM or enable cpu mode: --use-cpu all --no-half" printf "\n%s\n" "${delimiter}" ;; *) ;; esac

iDeNoh commented 1 year ago

I'm not sure if this is a problem with my system specifically or what but with the current method of detecting the hardware my system is defaulting to CPU only, After digging around I found that the rocm-smi doesn't seem to be a valid command on my system, changing line 179 to "elif shutil.which('rocminfo') is not None:" does work, though I'm not sure if that's the best way to do it.

After a bit of digging I found that I can access rocm-smi if I add the full path (/opt/rocm/bin/rocm-smi) to the command, so I'm not sure whats going on.

vladmandic commented 1 year ago

ok, i've switched from rocm-smi to rocminfo - that's why i asked community whats the best and always present bin.

regarding setup of env variable:

gpu_info=$(lspci 2>/dev/null | grep VGA)
case "$gpu_info" in
    *"Navi 1"*|*"Navi 2"*) export HSA_OVERRIDE_GFX_VERSION=10.3.0
    *"Renoir"*) export HSA_OVERRIDE_GFX_VERSION=9.0.0

running lspci is really bad, it can segfault/fail on some virtualized platform, especially cloud ones. and its not going to work as expected unless its a bare-metal linux install. need to find a better way to determine which HSA_OVERRIDE_GFX_VERSION to set.

since Navi is more common nowadays, perhaps set that as default and for Renoir leave it as documentation note?

again, i need community help for that :)

    elif shutil.which('rocminfo') is not None:
        log.info('AMD toolkit detected')
        os.environ.setdefault('HSA_OVERRIDE_GFX_VERSION', '10.3.0')
        torch_command = os.environ.get('TORCH_COMMAND', 'torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2')
        xformers_package = os.environ.get('XFORMERS_PACKAGE', 'none')

Soulreaver90 commented 1 year ago

I tried rocminfo in terminal and it shows info for both my AMD GPU and CPU, that might cause false positives for the CPU only folks.

vladmandic commented 1 year ago

@Soulreaver90 the point is that rocminfo itself will not exist unless you have ROCm system.

Soulreaver90 commented 1 year ago

@Soulreaver90 the point is that rocminfo itself will not exist unless you have ROCm system.

You are right. Had a brain fart moment lol.

iDeNoh commented 1 year ago

Could lshw work? you could do "gpu_info=$(lshw -short | grep display)", however it would throw a comment about sudo every time you launch, and unfortunately it wouldn't come pre-installed on all systems afaik.

As an alternative, I've been attempting to find a way to get glxinfo to work, but i've yet to find a solution for that option.

vladmandic commented 1 year ago

ok, so all community suggestions on what to do for defaults on ROCm setups have been added and I haven't seen any further updates on this thread, so I'll close it. if there are any remaining issues or further tuning needed, lets start with the new thread as there is a lot of history here.

MightyPork commented 1 year ago

I'm trying to get this tool working, after using Easy Diffusion for a while without problem - using export HSA_OVERRIDE_GFX_VERSION=10.3.0.

By itself it says nVidia CUDA toolkit detected despite there being no nVidia and no CUDA packages (but I used to have an nVidia card).

Everything rocm is installed.

I tried to force it using flags, sometimes there are random errors, xformers is removed and other times installed again, but either way, it never uses GPU acceleration.

I added rembg and xformers to requirements.txt, thinking that will help. rembg helped, there was a crash when it couldn't be found. xformers probably confused something.

/opt/sdnext (git)-[master] % ./webui.sh --experimental --reinstall --use-rocm
Create and activate python venv
Launching launch.py...
00:41:12-951991 INFO     Running extension preloading                                                                 
00:41:12-956514 INFO     Starting SD.Next                                                                             
00:41:12-957464 INFO     Python 3.11.3 on Linux                                                                       
00:41:12-967243 INFO     Version: 5f2bdba8 Fri Jun 2 12:56:44 2023 -0400                                              
00:41:13-231622 INFO     Setting environment tuning                                                                   
00:41:13-232921 INFO     Forcing reinstall of all packages                                                            
00:41:13-233952 INFO     AMD ROCm toolkit detected                                                                    
00:41:13-234645 INFO     Installing package: torch==2.0.0 torchvision==0.15.1 --index-url                             
                         https://download.pytorch.org/whl/rocm5.4.2                                                   
00:41:14-672970 ERROR    Error running pip: install --upgrade torch==2.0.0 torchvision==0.15.1 --index-url            
                         https://download.pytorch.org/whl/rocm5.4.2                                                   
00:41:15-820805 INFO     Torch 2.0.1+cu118                                                                            
00:41:15-914591 INFO     Installing package: tensorflow==2.12.0                                                       
00:41:19-090670 INFO     Verifying requirements                                                                       
00:41:19-093038 INFO     Installing package: addict                                                                   
00:41:21-546644 INFO     Installing package: aenum                                                                    
00:41:23-982640 INFO     Installing package: aiohttp            
...

now some things are uninstalled:

% ./webui.sh --experimental --use-rocm
Create and activate python venv
Launching launch.py...
00:45:12-052053 INFO     Running extension preloading                                                                 
00:45:12-056847 INFO     Starting SD.Next                                                                             
00:45:12-057842 INFO     Python 3.11.3 on Linux                                                                       
00:45:12-067872 INFO     Version: 5f2bdba8 Fri Jun 2 12:56:44 2023 -0400                                              
00:45:12-342813 INFO     Setting environment tuning                                                                   
00:45:12-346087 INFO     AMD ROCm toolkit detected                                                                    
00:45:13-516102 INFO     Torch 2.0.1+cu118                                                                            
00:45:13-597207 WARNING  Not used, uninstalling: xformers 0.0.20                                                      
00:45:13-598729 INFO     Installing package: un xformers --yes --quiet                                                
00:45:14-307632 INFO     Verifying requirements                                                                       
00:45:14-344259 WARNING  Package wrong version: numpy 1.24.3 required 1.23.5                                          
00:45:14-345265 INFO     Installing package: numpy==1.23.5

Everything looks happy, but the GPU is not detected.

rocminfo:

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 5 1600 Six-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 5 1600 Six-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3200                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            12                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32795612(0x1f46bdc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32795612(0x1f46bdc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32795612(0x1f46bdc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6400                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      1024(0x400) KB                     
    L3:                      16384(0x4000) KB                   
  Chip ID:                 29759(0x743f)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2320                               
  BDFID:                   2560                               
  Internal Node ID:        1                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4177920(0x3fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Some ideas what else to try?

vladmandic commented 1 year ago

@MightyPork don't post a new issue on an already closed thread (and one which deals with different issue to start with) - i cannot help here.

MightyPork commented 1 year ago

surry but I didn't want to create a new issue for likely the same problem, rocm gpu is not detected/used

vladmandic / automatic

[Issue]: Optimize for AMD with ROCm #217

Issue Description

Version Platform Description