nktice / AMD-AI

AMD (Radeon GPU) ROCm based setup for popular AI tools on Ubuntu 22.04 / 23.04 / 23.10 / 24.04
Other
136 stars 8 forks source link

Start gen image stuck Kubuntu 24.04 7900XTX (solved) #4

Open tomasmark79 opened 1 month ago

tomasmark79 commented 1 month ago

Hi, I want to first say BIG thanks for this awesome list of commands which help me to save some of the time. Thank you.

I am facing with the issue on my PC Setup.

All the commands followed and web all is installed, and the webserver started as expected. Unfortunately, when start to gerenating picture from the text the power of GPU is going UP in NVTop but after 1 - 2 seconds GPU failed.

Do you know what I have try to do to avoid this fatal issue?

Thank you

tomasmark79 commented 1 month ago

fixed with

Start-Date: 2024-05-27  13:41:39
Commandline: apt install linux-image-generic-hwe-24.04
Requested-By: tomas (1000)
Install: linux-image-generic-hwe-24.04:amd64 (6.8.0-31.31)
End-Date: 2024-05-27  13:41:40
tomasmark79 commented 1 month ago

Update:

Previous kernel change in my comment is not the culprit. To possibility use txt2img:

  1. I have to first once use img2img with any picture.
  2. Then I can switch to txt2img and generate images.

Crazy. But it works.

Awesome image

nktice commented 1 month ago

I will note the spike you're seeing in NVtop is GPU use. If you want to see other GPU info it's in the settings... I for example turn off all the processes so it's just the chart... and then on the chart, I turn on temp and fan speed, so I can see those. So from your screenshot, it does look like it's 'working', at least for a while. I get bumps like that when I'm rendering something, so that looks normal.

As for your issue - is this problem on with Stable Diffusion? I find it's sometimes quirky and needs reloading, but generally works.

tomasmark79 commented 1 month ago

I already did that I have desktop as a server and I am connected to the Stable FDiff via web http, or via XRDP or SSH. This I hope save as many as possible GPU ram, etc.

Despite everything all efforts I have to be very careful at the start of first rendering after start the web-server because two times of five attempts is computer resetted.

I am newbie and I need more time to get all the background information about that topic.

My original plan was to get know how how to solve TTS with my setup.

As the side effect I have got awesome Stable Diffusion AI :-).

My goal is to get TTS solution to train my voice for Czech language. We have no pretty voice TTS free yet.

Thank you.

nktice commented 1 month ago

Have you considered boosting SD's log level to be more verbose? such as calling it with the following command when you start it up -

./webui.sh --log-level INFO 

Watching the console may give you some hints about what causes crashes. The ecosystem has done lots of evolving, so lots of stuff breaks,

Also have you tried with different versions of Stable Diffusion?

tomasmark79 commented 1 month ago

Launching Web UI with arguments: --listen --log-level INFO --api

launch.py: error: unrecognized arguments: --log-level INFO

This parametr is not known.

nktice commented 1 month ago

My apologies... I double checked the options, and there's a typo...

./webui.sh --loglevel INFO 

You can also see all of the parameters with the following ...

./webui.sh --help 

That may be useful, in case that's too much info, and you'd like to try one of the other verbosity selections for the log level.

tomasmark79 commented 1 month ago

Ah, ok, thank you. I have found culprit:

To create a public link, set `share=True` in `launch()`.
Startup time: 6.6s (prepare environment: 2.0s, import torch: 2.2s, import gradio: 0.5s, setup paths: 0.7s, other imports: 0.3s, load scripts: 0.1s, create ui: 0.4s, add APIs: 0.2s).
Applying attention optimization: Doggettx... done.
Model loaded in 3.3s (load weights from disk: 0.2s, create model: 0.6s, apply weights to model: 2.1s, calculate empty prompt: 0.3s).
2024-05-30 22:03:26 INFO [httpx] HTTP Request: POST http://localhost:7860/api/predict "HTTP/1.1 200 OK"
2024-05-30 22:03:26 INFO [httpx] HTTP Request: POST http://localhost:7860/api/predict "HTTP/1.1 200 OK"
2024-05-30 22:03:26 INFO [httpx] HTTP Request: POST http://localhost:7860/reset "HTTP/1.1 200 OK"
2024-05-30 22:03:26 INFO [httpx] HTTP Request: POST http://localhost:7860/reset "HTTP/1.1 200 OK"
2024-05-30 22:03:26 INFO [httpx] HTTP Request: POST http://localhost:7860/api/predict "HTTP/1.1 200 OK"
2024-05-30 22:03:26 INFO [httpx] HTTP Request: POST http://localhost:7860/api/predict "HTTP/1.1 200 OK"
2024-05-30 22:03:26 INFO [httpx] HTTP Request: POST http://localhost:7860/api/predict "HTTP/1.1 200 OK"
2024-05-30 22:03:26 INFO [httpx] HTTP Request: POST http://localhost:7860/reset "HTTP/1.1 200 OK"
2024-05-30 22:03:26 INFO [httpx] HTTP Request: POST http://localhost:7860/reset "HTTP/1.1 200 OK"
2024-05-30 22:03:26 INFO [httpx] HTTP Request: POST http://localhost:7860/reset "HTTP/1.1 200 OK"
2024-05-30 22:03:36 INFO [modules.shared_state] Starting job task(gjvrs77zv4a94s6)
 65%|██████████████████████████████████████████████████████████████▍                                 | 13/20 [00:00<00:00, 17.65it/s]

Memory access fault by GPU node-1 (Agent handle: 0x194c680) on address 0x761647200000. Reason: Page not present or supervisor privilege.

./webui.sh: řádek 292:  3844 Neúspěšně ukončen (SIGABRT)        (core dumped [obraz paměti uložen]) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"
tomas@pc-kortex:~/stable-diffusion-webui$ 

This error Memory access fault by GPU node-1 (Agent handle: 0x194c680) on address 0x761647200000. Reason: Page not present or supervisor privilege.

is happening when I didn't reload of any checkpint model.

Here is some more information: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/8139

nktice commented 4 weeks ago

Does the same happen running different versions of SD software ? So for example, I've included these commands in the current guide,

git checkout bef51ae
git reset --hard

as newer versions don't use the same API, breaking TGW... Similarly there's codes for every version of SD's software.

Occasionally I get a crash when loading Stable Diffusion - from what it looks like it's when it tries to update referred components... so I run again, and then it works after that, 'til the next update. I note that SD, while it's got versions itself, has lots of sub-tools - some that it maintains, and likes to update itself, that may cause issues. So while one might think that it is static software, it is dynamic.