[bug]: Segmentation fault on image generation start (AMD)

redhelling21 commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

OS

Linux

GPU

amd

VRAM

8GB

What version did you experience this issue on?

3.0.0

What happened?

I tried to install via the automated installer and the manual installation. No matter what I try, when I click on the "Invoke" button on the web GUI, I get a segmentation fault :

$ invokeai --web [2023-07-24 23:32:06,280]::[InvokeAI]::INFO --> Patchmatch initialized /home/hellong/.venv/lib/python3.10/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be removed in 0.17. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional. warnings.warn( INFO: Started server process [18287] INFO: Waiting for application startup. [2023-07-24 23:32:06,661]::[InvokeAI]::INFO --> InvokeAI version 3.0.0 [2023-07-24 23:32:06,661]::[InvokeAI]::INFO --> Root directory = /home/hellong/invokeai [2023-07-24 23:32:06,662]::[InvokeAI]::INFO --> GPU device = cuda AMD Radeon RX 6700 XT [2023-07-24 23:32:06,664]::[InvokeAI]::INFO --> Scanning /home/hellong/invokeai/models for new models [2023-07-24 23:32:06,857]::[InvokeAI]::INFO --> Scanned 5 files and directories, imported 0 models [2023-07-24 23:32:06,859]::[InvokeAI]::INFO --> Model manager service initialized INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:9090 (Press CTRL+C to quit) INFO: 127.0.0.1:35052 - "GET /socket.io/?EIO=4&transport=polling&t=Oc9qHwH HTTP/1.1" 200 OK INFO: 127.0.0.1:35052 - "POST /socket.io/?EIO=4&transport=polling&t=Oc9qHwJ&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK INFO: 127.0.0.1:35052 - "GET /socket.io/?EIO=4&transport=polling&t=Oc9qHwK&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK INFO: ('127.0.0.1', 35066) - "WebSocket /socket.io/?EIO=4&transport=websocket&sid=ZXwRuIab-6GgOo1cAAAA" [accepted] INFO: connection open INFO: 127.0.0.1:35052 - "GET /socket.io/?EIO=4&transport=polling&t=Oc9qHwM&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK INFO: 127.0.0.1:35052 - "POST /socket.io/?EIO=4&transport=polling&t=Oc9qHwW&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK INFO: 127.0.0.1:35052 - "GET /socket.io/?EIO=4&transport=polling&t=Oc9qHx3&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK INFO: 127.0.0.1:35052 - "GET /socket.io/?EIO=4&transport=polling&t=Oc9qHx5&sid=ZXwRuIab-6GgOo1cAAAA HTTP/1.1" 200 OK INFO: 127.0.0.1:35052 - "POST /api/v1/sessions/ HTTP/1.1" 200 OK INFO: 127.0.0.1:35052 - "PUT /api/v1/sessions/50d99cec-2fc6-4e59-9219-f7e9d0dbf159/invoke?all=true HTTP/1.1" 202 Accepted [2023-07-24 23:32:13,517]::[InvokeAI]::INFO --> Loading model /home/hellong/invokeai/models/sd-1/main/stable-diffusion-v1-5, type sd-1:main:tokenizer [2023-07-24 23:32:13,747]::[InvokeAI]::INFO --> Loading model /home/hellong/invokeai/models/sd-1/main/stable-diffusion-v1-5, type sd-1:main:text_encoder Segmentation fault (core dumped)

Screenshots

No response

Additional context

Using ROCm 5.4.2, as recommended by the Pytorch official website. GPU : AMD Radeon 6700 XT

Contact Details

No response

puresick commented 1 year ago

Same happening to me with an AMD Radeon 5500 XT with 8GB of VRAM.

Something similar has also happening to me pre-3.0, but that issue has been closed since the open issues got reset with the 3.0 release: https://github.com/invoke-ai/InvokeAI/issues/2894#issuecomment-1594544291

tokenwizard commented 1 year ago

I'm also having this issue. When you click the Invoke button, about 5-10 seconds later the console shows the Seg Fault.

Freshly installed using the install script on Linux and using the Analog-Diffusion model.

System Specs are below.

Here is potentially relevant dmesg output:

[Wed Jul 26 08:23:30 2023] invokeai-web[1009479]: segfault at 20 ip 00007f4e27ab40a7 sp 00007f4b5dff9290 error 4 in libamdhip64.so[7f4e27a00000+3f3000] likely on CPU 12 (core 4, socket 0)
[Wed Jul 26 08:23:30 2023] Code: 8d 15 5d 6d 25 00 48 8d 3d f6 6c 25 00 be 32 00 00 00 e8 dc ed 1f 00 e8 c7 ed 1f 00 48 8b 45 b8 48 8b 50 28 4c 8b 24 da 31 c0 <41> 80 7c 24 20 00 74 11 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d
[Wed Jul 26 08:59:06 2023] invokeai-web[1012831]: segfault at 20 ip 00007fbf1fcb40a7 sp 00007fbc55ff9290 error 4 in libamdhip64.so[7fbf1fc00000+3f3000] likely on CPU 9 (core 1, socket 0)
[Wed Jul 26 08:59:06 2023] Code: 8d 15 5d 6d 25 00 48 8d 3d f6 6c 25 00 be 32 00 00 00 e8 dc ed 1f 00 e8 c7 ed 1f 00 48 8b 45 b8 48 8b 50 28 4c 8b 24 da 31 c0 <41> 80 7c 24 20 00 74 11 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d

arvenig commented 1 year ago

I appear to have been experiencing this issue too, Linux, Radeon 6900XT. Hopefully relevent detail is that I was able to work around it by using torch version 1.13.1+rocm5.2 and corresponding torchvision 0.14.1+rocm5.2 that I still had from my working Invoke 2.3.5 install. After replacing torch 2.0 and torchvision with those older versions Invoke 3.0 now seems to work as expected for me.

Alex9001 commented 1 year ago

I have the same problem.

OS: Artix Linux x86_64 GPU: AMD ATI Radeon RX 6600/6600 XT/6600M CPU: AMD Ryzen 7 5800H

arvenig commented 1 year ago

Was experiencing this issue on my Ryzen 7950X / Radeon 6900XT desktop system running Arch Linux. I seem to have worked around it by disabling the 7950X's iGPU in BIOS. The GPU device reported by invokeai-web at startup both with and without the iGPU enabled is 'cuda AMD Radeon RX 6900 XT', but for whatever reason having the iGPU enabled seems to have been causing an issue. This issue has been present for me in all versions of invoke since the update to torch 2.0. Tested on a fresh InvokeAI 3.0.1post3 install.

Godd67 commented 1 year ago

Yep, same issue for me - 2.3 version worked perfectly, 3.0.1post3 (fresh install) failed with segerror RX6600, Ubuntu 22.04, Rocm 5.6 [2023-08-05 19:00:28,561]::[uvicorn.access]::INFO --> 127.0.0.1:35828 - "PUT /api/v1/sessions/f3756076-a290-4c92-af83-28ccd8e881d4/invoke?all=true HTTP/1.1" 202 [2023-08-05 19:00:28,575]::[InvokeAI]::INFO --> Loading model /media/olegus/Extra/InvokeAi/models/sd-1/main/stable-diffusion-v1-5, type sd-1:main:tokenizer [2023-08-05 19:00:28,873]::[InvokeAI]::INFO --> Loading model /media/olegus/Extra/InvokeAi/models/sd-1/main/stable-diffusion-v1-5, type sd-1:main:text_encoder ./invoke.sh: line 51: 99206 Segmentation fault (core dumped) invokeai-web $PARAMS

Millu commented 1 year ago

Hey! Another person had similar issues with torch and a fix seems to be building a version of python with a different lower torch version (similar to what @arvenig said!):

https://github.com/invoke-ai/InvokeAI/issues/4041#issuecomment-1654738252

Godd67 commented 1 year ago

Hey! Another person had similar issues with torch and a fix seems to be building a version of python with a different lower torch version (similar to what @arvenig said!):

#4041 (comment)

Can someone explain in simple words how to achieve it? BTW, I use Python 3.10 as it was suggested for previous InvokeAi version.

YabbaYabbaYabba commented 1 year ago

i have the same issue - invoke.sh: line 51: 8792 Segmentation fault (core dumped) invokeai-web $PARAMS

Jeremi360 commented 1 year ago

I have the same issue ./invoke.sh: linie 51: 4167 Segmentation fault (core dumped) invokeai-web $PARAMS

Godd67 commented 1 year ago

Made it work with rocm 5.4.2 and rx6600 and kernel 5.19 . Followed this guide - https://phazertech.com/tutorials/rocm.html , starting from Other Requirements section - already had rocm installed so cant comment on this part. It seems the only difference from my previous attempts was this - sudo apt install nvidia-cuda-toolkit

YabbaYabbaYabba commented 1 year ago

Thank you!

archer31 commented 11 months ago

Unfortunately none of the posted solutions work to resolve the segfault. What I have tried:

Downgrading torch and torchvision
- This just results in the gpu not being detected anymore
Upgrading torch and torchvision
- Same as above
- Applying HSA_OVERRIDE_GFX_VERSION=10.3.0 to my profile
- No appreciable changes

ROCM version 5.4.3 GPU: Radeon RX 7900 XTX InvokeAI version: 3.2.0 (same also happens in 3.3.0RC1)

Edit: This appears to be an issue with ROCM support for the 7000 series of AMD GPUs. not sure why these are still unsupported 9 months after they came out. guess ill just return this card and get an nvidia gpu :(.

adeliktas commented 11 months ago

I just installed in python3.11 venv InvokeAI 3.3.0 with rocm for amd 6600xt and encountered the same issue when pressing "Invoke" Button on the webui.

segfault at 20 ip 00007fd2142b40a7 sp 00007fcecfe91470 error 4 in libamdhip64.so[7fd214200000+3f3000]

pytorch-triton-rocm 2.0.2 torch 2.0.1+rocm5.4.2 torchvision 0.15.2+rocm5.4.2

.../InvokeAI/.venv/lib/python3.11/site-packages/triton/third_party/rocm/lib/libamdhip64.so .../InvokeAI/.venv/lib/python3.11/site-packages/torch/lib/libamdhip64.so

gdb last traces


[#6] 0x7fffad3c93e4 → hipLaunchKernel()
[#7] 0x7fffaf7b3a3b → at::native::index_select_out_cuda(at::Tensor const&, long, at::Tensor const&, at::Tensor&)::{lambda()#2}::operator()() const()
[#8] 0x7fffaf791d5a → at::native::index_select_out_cuda(at::Tensor const&, long, at::Tensor const&, at::Tensor&)()
[#9] 0x7fffaf7c947b → at::native::index_select_cuda(at::Tensor const&, long, at::Tensor const&)()

takov751 commented 10 months ago

Unfortunately none of the posted solutions work to resolve the segfault. What I have tried:

Downgrading torch and torchvision

This just results in the gpu not being detected anymore

Upgrading torch and torchvision

Same as above

Applying HSA_OVERRIDE_GFX_VERSION=10.3.0 to my profile

No appreciable changes

ROCM version 5.4.3

GPU: Radeon RX 7900 XTX

InvokeAI version: 3.2.0 (same also happens in 3.3.0RC1)

Edit: This appears to be an issue with ROCM support for the 7000 series of AMD GPUs. not sure why these are still unsupported 9 months after they came out. guess ill just return this card and get an nvidia gpu :(.

In your case it should be HSA_OVERRIDE_GFX_VERSION=11.0.0

adeliktas commented 10 months ago

Unfortunately none of the posted solutions work to resolve the segfault. What I have tried:

Downgrading torch and torchvision

This just results in the gpu not being detected anymore

Upgrading torch and torchvision

Same as above

Applying HSA_OVERRIDE_GFX_VERSION=10.3.0 to my profile

No appreciable changes

ROCM version 5.4.3 GPU: Radeon RX 7900 XTX InvokeAI version: 3.2.0 (same also happens in 3.3.0RC1) Edit: This appears to be an issue with ROCM support for the 7000 series of AMD GPUs. not sure why these are still unsupported 9 months after they came out. guess ill just return this card and get an nvidia gpu :(.
* In your case it should be HSA_OVERRIDE_GFX_VERSION=11.0.0

Setting gfx made invokeai run for my 6600 XT, but generating the image bugs and returns an invalid image. https://github.com/invoke-ai/InvokeAI/issues/4278 https://github.com/invoke-ai/InvokeAI/issues/4211 CUDA_VERSION=gfx1030 HSA_OVERRIDE_GFX_VERSION=10.3.0 invokeai-web

https://gist.github.com/adeliktas/669812e64fd356afc4648ba847c61133
torch version = 2.0.1+rocm5.4.2
cuda available = True
cuda version = None
device count = 1
cudart = <module 'torch._C._cudart'>
device = 0
capability = (10, 3)
name = AMD Radeon RX 6600 XT

hchasens commented 6 months ago

I'm seeing this with my 7900xtx

hchasens commented 6 months ago

So I figured it out. When using ROCm it tries to select your first GPU which is your integrated graphics. There's not enough VRAM so you get a segmentation fault. There's an environment variable you can use to disable the visibility of the iGPU.

export HIP_VISIBLE_DEVICES="0"

I found the best place to put it is in invokeai.sh right after the start of the venv.

. .venv/bin/activate

export INVOKEAI_ROOT="$scriptdir"
PARAMS=$@

export HIP_VISIBLE_DEVICES="0"

# Check to see if dialog is installed (it seems to be fairly standard, but good to check regardless) and if the user has passed the --no-tui argument to disable the dialog TUI

This fixed my issue. I've found a programs that have the same issue. Autogen and Text-gen-webui both have the same problem and solution.

Hope this has helped! It's a lot easier than phazertech's guide imo.

Alex9001 commented 6 months ago

So I figured it out. When using ROCm it tries to select your first GPU which is your integrated graphics. There's not enough VRAM so you get a segmentation fault. There's an environment variable you can use to disable the visibility of the iGPU.

export HIP_VISIBLE_DEVICES="0"

I found the best place to put it is in invokeai.sh right after the start of the venv.
. .venv/bin/activate

export INVOKEAI_ROOT="$scriptdir"
PARAMS=$@

export HIP_VISIBLE_DEVICES="0"

# Check to see if dialog is installed (it seems to be fairly standard, but good to check regardless) and if the user has passed the --no-tui argument to disable the dialog TUI
This fixed my issue. I've found a programs that have the same issue. Autogen and Text-gen-webui both have the same problem and solution.

Hope this has helped! It's a lot easier than phazertech's guide imo.

Very based.

adeliktas commented 6 months ago

after almost half a year, i've decided to give it another try and was able to find my issue after writing this. I've tried working with different env vars like HIP_VISIBLE_DEVICES="0" and ran two test scripts

https://gist.github.com/adeliktas/669812e64fd356afc4648ba847c61133 https://gist.github.com/damico/484f7b0a148a0c5f707054cf9c0a0533

torch version = 2.2.1+rocm5.7
cuda available = True
cuda version = None
device count = 1
cudart = <module 'torch._C._cudart'>
device = 0
capability = (10, 3)
name = AMD Radeon RX 6600 XT
...
Everything fine! You can run PyTorch code inside of: 
--->  AMD Ryzen 9 3950X 16-Core Processor  
--->  gfx1032

i did print all env vars with the env command and suprisingly found that HSA_OVERRIDE_GFX_VERSION wasn't listed, even though echo $HSA_OVERRIDE_GFX_VERSION prints 10.3.0 because i set it universally with set -U HSA_OVERRIDE_GFX_VERSION 10.3.0 in fish which doesnt export it to bash and is only shared in fish a simple export HSA_OVERRIDE_GFX_VERSION=10.3.0 solved that

PWD=/home/adeliktas/ai/invokeai_projects/InvokeAI
HSA_OVERRIDE_GFX_VERSION=10.3.0
INVOKEAI_ROOT=/home/adeliktas/ai/invokeai_projects/InvokeAI
HIP_VISIBLE_DEVICES=0
VIRTUAL_ENV_PROMPT=(InvokeAI)
_OLD_FISH_PROMPT_OVERRIDE=/home/adeliktas/ai/invokeai_projects/InvokeAI/.venv
VIRTUAL_ENV=/home/adeliktas/ai/invokeai_projects/InvokeAI/.venv
upstream InvokeAI version 4.0.0rc2 faa1ffb06fd4974c43be14a2119a1aab12b63038

Developer-42 commented 5 months ago

So I figured it out. When using ROCm it tries to select your first GPU which is your integrated graphics. There's not enough VRAM so you get a segmentation fault. There's an environment variable you can use to disable the visibility of the iGPU.

export HIP_VISIBLE_DEVICES="0"

I found the best place to put it is in invokeai.sh right after the start of the venv.
. .venv/bin/activate

export INVOKEAI_ROOT="$scriptdir"
PARAMS=$@

export HIP_VISIBLE_DEVICES="0"

# Check to see if dialog is installed (it seems to be fairly standard, but good to check regardless) and if the user has passed the --no-tui argument to disable the dialog TUI
This fixed my issue. I've found a programs that have the same issue. Autogen and Text-gen-webui both have the same problem and solution.

Hope this has helped! It's a lot easier than phazertech's guide imo.

Sadly, this doesn't work for me with my AMD Radeon RX 7800 XT. Also, the file name is invoke.sh not invokeai.sh

takov751 commented 5 months ago

So I figured it out. When using ROCm it tries to select your first GPU which is your integrated graphics. There's not enough VRAM so you get a segmentation fault. There's an environment variable you can use to disable the visibility of the iGPU. export HIP_VISIBLE_DEVICES="0" I found the best place to put it is in invokeai.sh right after the start of the venv.
. .venv/bin/activate

export INVOKEAI_ROOT="$scriptdir"
PARAMS=$@

export HIP_VISIBLE_DEVICES="0"

# Check to see if dialog is installed (it seems to be fairly standard, but good to check regardless) and if the user has passed the --no-tui argument to disable the dialog TUI
This fixed my issue. I've found a programs that have the same issue. Autogen and Text-gen-webui both have the same problem and solution. Hope this has helped! It's a lot easier than phazertech's guide imo.
Sadly, this doesn't work for me with my AMD Radeon RX 7800 XT. Also, the file name is invoke.sh not invokeai.sh

have you specified HSA_OVERRIDE_GFX_VERSION=11.0.0 as your gpu is 7XXX series?

Alex9001 commented 5 months ago

I finally got around to trying export HIP_VISIBLE_DEVICES="0" ... and nothing happened. Just as before,

::[uvicorn.access]::INFO --> 127.0.0.1:38998 - "GET /api/v1/queue/default/list HTTP/1.1" 200 ./invoke.sh: line 56: 9079 Segmentation fault invokeai-web $PARAMS

hchasens commented 5 months ago

@Alex9001 This error message makes me think it might not be an ROCm issue. Never the less, it might be worth double checking to make sure your ROCm HIP runtime is up to date. I'm assuming the ROCm runtime is in your /opt/rocm/ folder? It might be worth checking that, along with your package manager to see if there are any updates. Use some of the tools AMD ships with the runtime to make sure it's communicating with your hardware properly (maybe using rocminfo or the like. If your GPU is supported you should see it listed

Serpentian commented 1 month ago

Placing export HSA_OVERRIDE_GFX_VERSION=11.0.0 right after venv activation in invoke.sh fixed the issue with AMD Radeon RX 7800 XT. Here's the source: https://github.com/invoke-ai/InvokeAI/issues/4211#issuecomment-1886423884

invoke-ai / InvokeAI