Memory access fault - Githubissues

Describe the bug

As I tried to chat with my llm through openai api, it just ran into a core dump: Memory access fault by GPU node-1 (Agent handle: 0x64e3ca9e2010) on address 0x75e681099000. Reason: Page not present or supervisor privilege.

Update: even in tg-webui itself the chat would crash

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

start the webui at a customed api port: python server.py --api --api-port 11451, and load a model
go to other GPT-webui like Sider or NextChat, set the api server and chat
error occurs!

Screenshot

the api server crashed with terminal message below:

the webui crashed:

with similar terminal output:

Logs

the first crash using api server:

❯ python server.py --api --api-port 11451
12:55:07-606488 INFO     Starting Text generation web UI                                                                                                                                                                           
12:55:07-608075 INFO     Loading the extension "openai"                                                                                                                                                                            
12:55:07-649839 INFO     OpenAI-compatible API URL:                                                                                                                                                                                

                         http://127.0.0.1:11451                                                                                                                                                                                    

Running on local URL:  http://127.0.0.1:7860

12:55:16-379990 INFO     Loading "Qwen-32B"                                                                                                                                                                                        
12:55:16-707760 WARNING  You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage to be a lot higher than it could be.                                                                                    
                         Try installing flash-attention following the instructions here: https://github.com/Dao-AILab/flash-attention#installation-and-features                                                                    
12:55:49-579044 INFO     LOADER: "ExLlamav2"                                                                                                                                                                                       
12:55:49-580220 INFO     TRUNCATION LENGTH: 4096                                                                                                                                                                                   
12:55:49-580571 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                                                                                                             
12:55:49-580940 INFO     Loaded the model in 33.20 seconds.                                                                                                                                                                        
Memory access fault by GPU node-1 (Agent handle: 0x64e3ca9e2010) on address 0x75e681099000. Reason: Page not present or supervisor privilege.
[1]    616392 IOT instruction (core dumped)  python server.py --api --api-port 11451

the second crash using tg-webui:

❯ python server.py --api --listen --model Qwen-32B
13:14:08-576771 INFO     Starting Text generation web UI                                                                                                                                                                           
13:14:08-578185 WARNING                                                                                                                                                                                                            
                         You are potentially exposing the web UI to the entire internet without any access password.                                                                                                               
                         You can create one with the "--gradio-auth" flag like this:                                                                                                                                               

                         --gradio-auth username:password                                                                                                                                                                           

                         Make sure to replace username:password with your own.                                                                                                                                                     
13:14:08-581204 INFO     Loading "Qwen-32B"                                                                                                                                                                                        
13:14:08-932613 WARNING  You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage to be a lot higher than it could be.                                                                                    
                         Try installing flash-attention following the instructions here: https://github.com/Dao-AILab/flash-attention#installation-and-features                                                                    
13:14:33-197401 INFO     LOADER: "ExLlamav2"                                                                                                                                                                                       
13:14:33-198636 INFO     TRUNCATION LENGTH: 4096                                                                                                                                                                                   
13:14:33-198966 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                                                                                                             
13:14:33-199306 INFO     Loaded the model in 24.62 seconds.                                                                                                                                                                        
13:14:33-199649 INFO     Loading the extension "openai"                                                                                                                                                                            
13:14:33-265414 INFO     OpenAI-compatible API URL:                                                                                                                                                                                

                         http://0.0.0.0:5000                                                                                                                                                                                       

Running on local URL:  http://0.0.0.0:7860

Memory access fault by GPU node-1 (Agent handle: 0x6452407b78b0) on address 0x75980082f000. Reason: Page not present or supervisor privilege.
[1]    630061 IOT instruction (core dumped)  python server.py --api --listen --model Qwen-32B

System Info

OS: Arch Linux
GPU: AMD RX 7900XTX
ROCm: 6.0.3
tg-webui: main branch
model: Qwen1.5-32B-Chat-GPTQ-Int4

This is likely an old ROCm issue. Even if you system rocm is 6.0.3, pytorch bundles its own compiler which gets used for exllama's c extensions, and the current webui specifies rocm 5.6 prebuilt torch wheels still.

I found the XTX is way more stable if you use the Pytorch 2.3.0 release candidate w/ ROCm 6.0 then recompile exllama2 to match.

# Currently 2.3.0 RC-final
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/test/rocm6.0
# v0.0.19 commit hash
pip install -U git+https://github.com/turboderp/exllamav2.git@ad8691c6d1aab2d1ddbdcbe9341c7c7a96e59f2f

You can also save the compiled exllama to a wheel with pip wheel instead of pip install so you can re-use it.

If you're using the native arch linux python/rocm, my own wheel might work for you. exllamav2-0.0.19-cp311-cp311-linux_x86_64.zip

ROCm 5.6 and especially 5.7 are extremely unstable on my XTX. Frequent unrecoverable page faults including that exact error you see. If the roc6 exl2 method works I'd also recompile llama-cpp and other extensions to match for a rock solid experience.

oobabooga / text-generation-webui

Memory access fault #5890