v2.8.0 crashes and disappears when using CUDA (incompatible PTX)

dsjlee commented 1 month ago

Bug Report

GPT4All crashes and disappears when using CUDA.

Steps to Reproduce

Go to Application General Settings.
Choose CUDA: [your GPU name] in Device dropdown.
Load model and submit prompt in chat window.

Expected Behavior

Generate response by using GPU and show response text in chat window.

Your Environment

GPT4All version: v.2.8.0
Operating System: Windows 11
Chat model used: Llama3 Instruct, Mistral Instruct, Phi-3 Mini Instruct
GPU: Nvidia RTX A1000 6GB VRAM with driver R550 U5 (552.22) WHQL
CUDA Toolkit: v12.5, v12.4, v11.8 (made no difference)

Discord discussion shows other users also reporting the crash when using CUDA. I can see model is loaded into GPU's VRAM but it crashes and disappears nonetheless after submitting prompt.

PedzacyKapec commented 1 month ago

Same here Win10, rtx 4060

cebtenzzre commented 1 month ago

~~Could both of you confirm what model of CPU you have?~~ Not important, actually.

cebtenzzre commented 1 month ago

I was able to reproduce this issue - it's related to OOM. If OOM happens early enough, we handle it correctly:

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 256.00 MiB on device 1: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA1 buffer of size 268435456
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
LLAMA ERROR: failed to init context for model /mnt/nobackup/text-ai-models/gpt4all/Meta-Llama-3-8B-Instruct.Q4_0.gguf

But if it happens a little later (smaller margin), we crash:

ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
CUDA error: out of memory
  current device: 1, in function alloc at /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml-cuda.cu:312
  cuMemCreate(&handle, reserve_size, &prop, 0)
GGML_ASSERT: /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml-cuda.cu:62: !"CUDA error"

dsjlee commented 1 month ago

I was able to reproduce this issue - it's related to OOM. If OOM happens early enough, we handle it correctly:

Regarding OOM, cause of my issue may be different. All the model I mentioned runs fine in Vulkan mode. Phi-3 Instruct for example, occupies 3GB out of 6GB VRAM. I can see GP4All is loading the model into VRAM in CUDA mode, but it just crashes when prompt is submitted. Is there error log that GPT4All is saving somewhere?

Alex-work-1 commented 1 month ago

CPU crashes too

It crashes on CPU too. I have MacBook Air 2017 and it worked fine on CPU, but after a recent update it crashes on long prompts and clears clipboard (copied text) from the RAM. I suspect it stopped using swap memory and crashes when RAM runs out.

How to reproduce:

Select CPU only
Choose model Llama 3 Instruct or Mistral Instruct
Make a long prompt in 2-3 sentences (23 words or 150 characters and more).
Run

Laptop specifications:

RAM: 8 GB
CPU: 1,8 GHz Dual-Core Intel Core i5
Graphics: Intel HD Graphics 6000 1536 MB

Is it possible to roll back an update without uninstall and install GPT4ALL?

cebtenzzre commented 1 month ago

I have MacBook Air 2017

This issue is specifically related to an out-of-memory condition on NVIDIA graphics cards. Since you do not have an NVIDIA graphics card, this is not your issue - please open a new one.

Is it possible to roll back an update without uninstall and install GPT4ALL?

You can install v2.7.5 from here but it has to be installed to a clean directory - there is no one-step rollback.

It'd be best if you kept the latest version around so there's a better chance we can find the issue and fix it :P

cebtenzzre commented 1 month ago

Is there error log that GPT4All is saving somewhere?

If it's hitting a GGML_ASSERT then something is at least logged to stderr - but the Windows version of GPT4All has no console unless you build it from source with this line commented out.

I'm going to fix the known crash first (which on the surface has the exact same symptoms), and if it still crashes for you then we can try and diagnose your exact issue.

cebtenzzre commented 1 month ago

@dsjlee You can try the linked PR and see if it fixes your issue. I'm building an offline installer for it now, when it's done you will see it under the artifacts tab here.

Another way to see if your issue is the same as the crash I found is to reduce the GPU Layers setting to 1. If you no longer see a crash, then the issue is certainly OOM related.

dsjlee commented 1 month ago

Another way to see if your issue is the same as the crash I found is to reduce the GPU Layers setting to 1. If you no longer see a crash, then the issue is certainly OOM related.

I set GPU layer to 1 and I saw GPU VRAM utilization of 0.4GB out of 6GB. It crashed nonetheless. I'm not looking for resolution of this issue as I have several other ways of running LLMs on my machines. Feel free to close this issue. I only opened it because there were other people reporting it on discord just after v2.8.0 release.

cebtenzzre commented 1 month ago

It crashed nonetheless.

This is a console build of GPT4All. If you run it from a command prompt (%USERPROFILE%\gpt4all\bin\chat), it will log any CUDA errors on stderr. That way we'll at least know what CUDA is tripping on, if not OOM.

dsjlee commented 1 month ago

This is a console build of GPT4All. If you run it from a command prompt (%USERPROFILE%\gpt4all\bin\chat), it will log any CUDA errors on stderr. That way we'll at least know what CUDA is tripping on, if not OOM.

I installed and ran the v2.8.1, and it was able to finish generating response with Phi-3 Instruct model with CUDA GPU selected in settings. I didn't see any error message in console. So whatever is changed in v2.8.1 works, I guess. GPU utilization did show near 100%. Only minor thing is "Device:" is blank underneath where it displays tokens/sec.

cebtenzzre commented 4 weeks ago

Console output from raptoreum ("Wizz") on the Discord:

ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: the provided PTX was compiled with an unsupported toolchain.
  current device: 0, in function ggml_cuda_compute_forward at C:\Users\circleci\project\gpt4all-backend\llama.cpp-mainline\ggml-cuda.cu:2313
  err
GGML_ASSERT: C:\Users\circleci\project\gpt4all-backend\llama.cpp-mainline\ggml-cuda.cu:62: !"CUDA error"

Possibly related to driver or CUDA version (since foolishly we do unconditionally prefer system CUDA). Waiting for a reply to know what they have installed.

Occam also sees this on current llama.cpp, and has success if he explicitly adds "86" to CMAKE_CUDA_ARCHITECTURES. AFAIK, this shouldn't be necessary because CMake also builds PTX for all architectures by default.

cebtenzzre commented 3 weeks ago

Basically, this issue happens when the CUDA version reported by nvidia-smi for your given driver version is lower than the CUDA version used to build GPT4All. The online installers of GPT4All v2.8.0 were built with CUDA 12.5.

For most Maxwell and Pascal GPUs we are building a binary kernel and there is no issue. For newer GPUs, you should either update to the current latest NVIDIA driver (555.x) or use the offline installer, which was built with CUDA 12.4. Linux users with newer GPUs should use the offline installer as NVIDIA 555.x is not yet available.

StaMic01 commented 3 weeks ago

Thanks, it works! Update to R555 U1 (555.99) on Win 10 + RTX6000 Ada Gen

nomic-ai / gpt4all