Closed Orion-zhen closed 6 months ago
This is likely an old ROCm issue. Even if you system rocm is 6.0.3, pytorch bundles its own compiler which gets used for exllama's c extensions, and the current webui specifies rocm 5.6 prebuilt torch wheels still.
I found the XTX is way more stable if you use the Pytorch 2.3.0 release candidate w/ ROCm 6.0 then recompile exllama2 to match.
# Currently 2.3.0 RC-final
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/test/rocm6.0
# v0.0.19 commit hash
pip install -U git+https://github.com/turboderp/exllamav2.git@ad8691c6d1aab2d1ddbdcbe9341c7c7a96e59f2f
You can also save the compiled exllama to a wheel with pip wheel
instead of pip install
so you can re-use it.
If you're using the native arch linux python/rocm, my own wheel might work for you. exllamav2-0.0.19-cp311-cp311-linux_x86_64.zip
ROCm 5.6 and especially 5.7 are extremely unstable on my XTX. Frequent unrecoverable page faults including that exact error you see. If the roc6 exl2 method works I'd also recompile llama-cpp and other extensions to match for a rock solid experience.
Many thanks!
By upgrading pytorch and recompiling the exllamav2, the problem is solved.
(However, it will take time to see whether it's stable or not.)
Describe the bug
As I tried to chat with my llm through openai api, it just ran into a core dump:
Memory access fault by GPU node-1 (Agent handle: 0x64e3ca9e2010) on address 0x75e681099000. Reason: Page not present or supervisor privilege.
Update: even in tg-webui itself the chat would crash
Is there an existing issue for this?
Reproduction
python server.py --api --api-port 11451
, and load a modelScreenshot
the api server crashed with terminal message below:
the webui crashed:
with similar terminal output:
Logs
the first crash using api server:
the second crash using tg-webui:
System Info