nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
https://nomic.ai/gpt4all
MIT License
70.54k stars 7.69k forks source link

Linux AMD GPU crashes Display-Server if more than 2 Threads on GPU. Sysrq out of it. #1862

Closed Chris2000SP closed 9 months ago

Chris2000SP commented 9 months ago

System Info

Arch Linux AMD Ryzen 5800x3d AMD Radeon RX 6800 XT GPT4all 2.6.1 from AUR "aur/gpt4all-chat"

vulkaninfo.txt

Information

Reproduction

If using Defaults of 4 Threads on Device Auto (or more than 2 Threads) it crashes the Display-Server. Have to sysrq me out.

Expected behavior

No crash of Display-Server and APP and put an Error for this setting.

Note: I had the Same problem with rocm and leela-zero. rocm has fixed it and put an error msg. What i know is GPT4all uses Vulkan.

cebtenzzre commented 9 months ago

This setting controls the number of CPU threads used for non-GPU operations, so I think it's really odd that it crashes your display server. Did you check the log of Xorg or Wayland after it crashes (Ctrl-Alt-F2 to another tty first) to see what happened - or does the whole GPU driver crash (no response from the display at all)?

danisztls commented 9 months ago

@cebtenzzre I had a similar crash that caused GDM (Gnome) to restart. I was switching back and forth between Instruct and OpenOrca. I had radeontop opened and I overflowed the GTT (~16GB) before overflowing the VRAM (8GB).

Chris2000SP commented 9 months ago

The CPU Fan of my Noctua Cooler get to 100% and the Display Freezes. I definitively have to sysrq or Hardreset the PC. The Kernel do not Panic and i had running VoIP Session (Mumble) working at that freeze. I could Try to SSH me to my PC from laptop if i had sshd running what i didn't. I didn't checked the log files though. EDIT: @danisztls I think the VRAM filling has nothing to do with the crash if you mean switching the Models in gpt4all. I tried that. It really is the Threads problem on GPU for the crash for me. Please reevaluate that.

Chris2000SP commented 9 months ago

OK, i checked the journal:

journalctl -b #

an 21 23:28:43 pc kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jan 21 23:28:43 pc kernel: [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.

After 2 minutes later journal ends with a lot of killing processes and this:

Jan 21 23:28:53 pc kernel: Out of memory: Killed process 1496 (plasmashell) total-vm:9754260kB, anon-rss:276404kB, file-rss:476kB, shmem-rss:36kB, UID:1000 pgtables:2696kB oom_score_adj:200
EDIT: I did found this:

Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024): "Could not convert argument 0 at"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024):          "expression for globalPoint@qrc:/gpt4all/main.qml:860"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024): "Could not convert argument 0 at"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024):          "expression for globalPoint@qrc:/gpt4all/main.qml:860"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024): "Could not convert argument 0 at"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024):          "expression for globalPoint@qrc:/gpt4all/main.qml:860"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024): qrc:/gpt4all/main.qml:860: TypeError: Passing incompatible arguments to C++ functions from JavaScript is not allowed.
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024): "Could not convert argument 0 at"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024):          "expression for globalPoint@qrc:/gpt4all/main.qml:860"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024): "Could not convert argument 0 at"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024):          "expression for globalPoint@qrc:/gpt4all/main.qml:860"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024): "Could not convert argument 0 at"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024):          "expression for globalPoint@qrc:/gpt4all/main.qml:860"
Jan 21 23:27:21 pc plasmashell[15379]: [Warning] (Sun Jan 21 23:27:21 2024): qrc:/gpt4all/main.qml:860: TypeError: Passing incompatible arguments to C++ functions from JavaScript is not allowed.
Did that something wrong?
cebtenzzre commented 9 months ago

This really isn't a GPT4All bug - you are running out of either system RAM or GPU VRAM. Try a smaller model.

Linux does tend to freeze when it runs out of system RAM instead of killing the process, as it has pathological swapping behavior in some cases. This is more of a kernel bug than an app bug. I use some kernel patches similar to this to prevent this from happening with other programs: https://github.com/hakavlad/le9-patch

danisztls commented 9 months ago

@danisztls I think the VRAM filling has nothing to do with the crash if you mean switching the Models in gpt4all. I tried that. It really is the Threads problem on GPU for the crash for me. Please reevaluate that.

It was the GTT. Not RAM or VRAM but rather an allocated area in the RAM for the GPU to use.

This really isn't a GPT4All bug - you are running out of either system RAM or GPU VRAM. Try a smaller model.

8GB VRAM is supposed to handle a 3.8GB model. The problem might be unwanted threading.

cebtenzzre commented 9 months ago

I had a similar crash that caused GDM (Gnome) to restart. I was switching back and forth between Instruct and OpenOrca. I had radeontop opened and I overflowed the GTT (~16GB) before overflowing the VRAM (8GB).

This is #1840, unless you can also get it to happen by only changing the number of CPU threads without switching models.