Vulkan (Kompute) Crashes on Windows

cosmic-snow commented 4 months ago

Note: I'm putting a few notes down now, but I feel like it might have something to do with my system. I will probably expand on this after investigating a bit more.

Bug Report

Two crashes related to the Vulkan GPU backend. Loading a model once and chatting with it works, though.

...

Steps to Reproduce

Crash 1

Have Device set to Vulkan
Open the application.
(Don't load a model)
Close the application.

Crash 2

Have Device set to Vulkan
Load a model
Reload the model with the button next to the dropdown

Expected Behavior

No crashes.

Your Environment

GPT4All version: v3.1.1 offline installed & current main, built myself (e45685b)
Operating System: Windows 10; NVIDIA driver 560.70
Chat model used (if applicable): does not look like it matters

Notes:

I'm not entirely sure whether these two crashes are related, but pretty sure they're related to the backend update and accompanying changes, although I haven't tested everything on an old version yet.

Things I've tried which had no impact

Different versions of the Vulkan SDK (only crash 1)
Different versions of the NVIDIA driver (only crash 1)
Turns out, OBS Studio puts a layer in Vulkan even when it isn't started at all. Tried to disable the layer, then later uninstalled OBS completely.

What works so far

Load a model exactly once
Set LLMODEL_KOMPUTE=OFF
Env var DISABLE_LAYER_NV_OPTIMUS_1=1 (only crash 1)

Misc.

First crash may be related to "We now free the device and Vulkan instance as late as possible ..." in #2694.

I've also built it with kompute logging set to debug, which came with some other issues itself; however, one thing I've seen: Looks like before the second crash the Manager gets destroyed. Not sure about that yet, however.

cebtenzzre commented 4 months ago

I have so far been unable to reproduce this issue with an NVIDIA GPU on Windows. I believe this is specific to having an Intel GPU in a Windows machine (even if GPT4All does not use it), but I do not have a Windows machine with an Intel GPU conveniently available to test with.

cosmic-snow commented 4 months ago

Yes, I have not forgotten about this and I still have some ideas I want to try at some later point.

However, I'm currently set up for a CPU-only, single backend build; it simplifies some things and this issue doesn't get in the way.

cosmic-snow commented 3 months ago

At least the first one seems to be Optimus related. Setting the environment variable DISABLE_LAYER_NV_OPTIMUS_1=1 prevented it from crashing on shutdown.

I've also tried:

__VK_LAYER_NV_optimus=NVIDIA_only
__NV_PRIME_RENDER_OFFLOAD=1
__VK_LAYER_NV_optimus=non_NVIDIA_only

but these didn't seem to change anything

mshakirDr commented 3 months ago

I have just installed GPT4All 3.2.1. Intel 13th Gen i7 (integrated UHD graphics) and nvidia discrete Ada 2000 GPU (Windows 11, Lenovo P16 gen2). For the first time ever I have seen the 3D (shader?) part of my intel graphics card show activity (30-50%) when a chat related application is running. The activity starts as soon as GPT4All is run, and stops as soon the application is closed. The program uses nvidia GPU for embedding and chat model inference as per my observation. So the activity in the intel graphics card appears to be a bug or irrelevant processing. Maybe it is helpful for further investigation.

cosmic-snow commented 3 months ago

Qt itself does use GPU resources to draw the GUI, so that's probably where that activity on your system is coming from.

For me, that part is also running on the other GPU.

But thanks for the info, it can't be excluded that there's a connection. I might want to try changing how the workload is split at some point, too.

(Figuring these out is kind of low-priority for me because it doesn't look like this is widespread and they don't occur with typical usage of the application, either.)

cosmic-snow commented 3 months ago

So for the second one, it gets kind of tricky, because it does not occur in a debug build of the version I'm currently on (3ba9c6344d3610ff8b2d54b650f97eb288f6d6d1). Additionally, from what I've gathered before, it's some kind of heap corruption, so that makes it extra elusive.

However, I've made a RelWithDebugInfo build, ran that with the env var above so the first one doesn't occur, then activated all 'Basics' checks in Application Verifier. Here's the log of the first break:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<avrf:logfile xmlns:avrf="Application Verifier">
    <avrf:logSession TimeStarted="2024-08-16 : 08:54:00" PID="13520" Version="2">
        <avrf:logEntry Time="2024-08-16 : 08:55:00" LayerName="Leak" StopCode="0x901" Severity="Error">
            <avrf:message>A HANDLE was leaked.</avrf:message>
            <avrf:parameter1>10e4 - Value of the leaked handle. Run !htrace &lt;handle&gt; to get additional information about the handle if handle tracing is enabled.</avrf:parameter1>
            <avrf:parameter2>14977fd7060 - Address to the allocation stack trace. Run dps &lt;address&gt; to view the allocation stack.</avrf:parameter2>
            <avrf:parameter3>1497a42bfe2 - Address of the owner dll name. Run du &lt;address&gt; to read the dll name.</avrf:parameter3>
            <avrf:parameter4>7fff52350000 - Base of the owner dll. Run .reload &lt;dll_name&gt; = &lt;address&gt; to reload the owner dll. Use &apos;lm&apos; to get more information about the loaded and unloaded modules.</avrf:parameter4>
            <avrf:stackTrace>
                <avrf:trace>vfbasics!+7fff60948ad4 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7fff60949725 ( @ 0)</avrf:trace>
                <avrf:trace>ntdll!RtlRemoveVectoredContinueHandler+190 ( @ 0)</avrf:trace>
                <avrf:trace>ntdll!memset+1b680 ( @ 0)</avrf:trace>
                <avrf:trace>ntdll!LdrUnloadDll+11a ( @ 0)</avrf:trace>
                <avrf:trace>ntdll!LdrUnloadDll+94 ( @ 0)</avrf:trace>
                <avrf:trace>KERNELBASE!FreeLibrary+1e ( @ 0)</avrf:trace>
                <avrf:trace>vulkan-1!vkResetEvent+3f123 ( @ 0)</avrf:trace>
                <avrf:trace>vulkan-1!vkResetEvent+168f8 ( @ 0)</avrf:trace>
                <avrf:trace>vulkan-1!vkResetEvent+4bd4e ( @ 0)</avrf:trace>
                <avrf:trace>llamamodel-mainline-kompute!vk::Instance::destroy&lt;vk::DispatchLoaderDynamic&gt;+b (D:\dev\wrka\tmp5\gpt4all\build-gpt4all-chat-Desktop_Qt_6_5_1_MSVC2019_64bit-RelWithDebInfo\_deps\vulkan_header-src\include\vulkan\vulkan_funcs.hpp @ 81)</avrf:trace>
                <avrf:trace>llamamodel-mainline-kompute!kp::Manager::~Manager+4f (D:\dev\wrka\tmp5\gpt4all\gpt4all-backend\llama.cpp-mainline\ggml\src\kompute\src\Manager.cpp @ 80)</avrf:trace>
                <avrf:trace>llamamodel-mainline-kompute!ggml_backend_kompute_device_unref+59 (D:\dev\wrka\tmp5\gpt4all\gpt4all-backend\llama.cpp-mainline\ggml\src\ggml-kompute.cpp @ 1961)</avrf:trace>
                <avrf:trace>llamamodel-mainline-kompute!ggml_backend_buffer_free+18 (D:\dev\wrka\tmp5\gpt4all\gpt4all-backend\llama.cpp-mainline\ggml\src\ggml-backend.c @ 86)</avrf:trace>
                <avrf:trace>llamamodel-mainline-kompute!llama_model::~llama_model+58 (D:\dev\wrka\tmp5\gpt4all\gpt4all-backend\llama.cpp-mainline\src\llama.cpp @ 2739)</avrf:trace>
                <avrf:trace>llamamodel-mainline-kompute!llama_free_model+12 (D:\dev\wrka\tmp5\gpt4all\gpt4all-backend\llama.cpp-mainline\src\llama.cpp @ 19086)</avrf:trace>
                <avrf:trace>llamamodel-mainline-kompute!LLamaModel::~LLamaModel+36 (D:\dev\wrka\tmp5\gpt4all\gpt4all-backend\llamamodel.cpp @ 516)</avrf:trace>
                <avrf:trace>llamamodel-mainline-kompute!LLamaModel::`scalar deleting destructor&apos;+14 ( @ 0)</avrf:trace>
                <avrf:trace>chat!LLModelInfo::resetModel+25 (D:\dev\wrka\tmp5\gpt4all\gpt4all-chat\chatllm.cpp @ 97)</avrf:trace>
                <avrf:trace>chat!ChatLLM::unloadModel+a8 (D:\dev\wrka\tmp5\gpt4all\gpt4all-chat\chatllm.cpp @ 857)</avrf:trace>
                <avrf:trace>Qt6Core!QMetaCallEvent::placeMetaCall+3b (C:\Users\qt\work\qt\qtbase\src\corelib\kernel\qobject.cpp @ 628)</avrf:trace>
                <avrf:trace>Qt6Core!QObject::event+156 (C:\Users\qt\work\qt\qtbase\src\corelib\kernel\qobject.cpp @ 1391)</avrf:trace>
                <avrf:trace>Qt6Core!QCoreApplication::notify+67 (C:\Users\qt\work\qt\qtbase\src\corelib\kernel\qcoreapplication.cpp @ 1195)</avrf:trace>
                <avrf:trace>Qt6Core!QCoreApplication::notifyInternal2+c5 (C:\Users\qt\work\qt\qtbase\src\corelib\kernel\qcoreapplication.cpp @ 1115)</avrf:trace>
                <avrf:trace>Qt6Core!QCoreApplicationPrivate::sendPostedEvents+225 (C:\Users\qt\work\qt\qtbase\src\corelib\kernel\qcoreapplication.cpp @ 1895)</avrf:trace>
                <avrf:trace>Qt6Core!QEventDispatcherWin32::processEvents+90 (C:\Users\qt\work\qt\qtbase\src\corelib\kernel\qeventdispatcher_win.cpp @ 464)</avrf:trace>
                <avrf:trace>Qt6Core!QEventLoop::exec+194 (C:\Users\qt\work\qt\qtbase\src\corelib\kernel\qeventloop.cpp @ 182)</avrf:trace>
                <avrf:trace>Qt6Core!QThread::exec+bd (C:\Users\qt\work\qt\qtbase\src\corelib\thread\qthread.cpp @ 578)</avrf:trace>
                <avrf:trace>Qt6Core!QThreadPrivate::start+131 (C:\Users\qt\work\qt\qtbase\src\corelib\thread\qthread_win.cpp @ 292)</avrf:trace>
                <avrf:trace>vfbasics!+7fff6095752e ( @ 0)</avrf:trace>
                <avrf:trace>KERNEL32!BaseThreadInitThunk+14 ( @ 0)</avrf:trace>
                <avrf:trace>ntdll!RtlUserThreadStart+21 ( @ 0)</avrf:trace>
            </avrf:stackTrace>
        </avrf:logEntry>
    </avrf:logSession>
</avrf:logfile>

Not sure about the handle leak there, that can't be all of it. The stack shows where it usually goes awry, though.

Attached is the full log (repeating the steps for 2) and running to the crash. chat.exe.dat.xml.zip

I think there is some kind of race condition.

cebtenzzre commented 3 months ago

Additionally, from what I've gathered before, it's some kind of heap corruption, so that makes it extra elusive.

I'm not familiar with Application Verifier. Maybe you could try building with Address Sanitizer enabled? It's a good way to check for heap corruption.

The handle leak is interesting, as it may point to a resource leak that I suspected to be internal to the NVIDIA drivers when I was debugging why I couldn't vkDestroyInstance/vkCreateInstance in a loop without crashing—perhaps I missed some bug in Kompute. But it wouldn't cause a crash without reloading the model quite a few times.

The heap corruption report would explain the crash, but the stack trace is all over the place and doesn't really make any sense.

cosmic-snow commented 3 months ago

The heap corruption report would explain the crash, but the stack trace is all over the place and doesn't really make any sense.

Not sure why you say that, it's consistently in the clean-up routine of the Kompute Manager where it detects something amiss.

Also note: It's 3 handle leaks followed by 2 virtual reservation leaks followed by a crash.

By the way, I don't understand much about Vulkan, but is there a good reason to repeatedly (= whenever a model is un/loaded) create and destroy that Vulkan Instance? From what I've seen in the meantime, shouldn't it just be once when Vulkan is initialised (= I guess the first time a model is loaded through Vulkan) and then clean-up at application exit?

But it wouldn't cause a crash without reloading the model quite a few times.

I did reload it several times when I had this enabled. I guess I should've written down every step.

If I repeat it right now without Application Verifier:

Load Phi-3 Mini Instruct
Eject
Reload
Eject => crash

cosmic-snow commented 3 months ago

With ASan (but without Application Verifier) I'm getting yet another error:

At least with this kind of build, I'm seeing the locals instead of it telling me they've been optimised away. The library it tries to unload is indeed nvoglv64.dll. The second one in the list to unload would've been igvk64.dll.

Also, I had to skip many complaints when I attached the debugger at the start, I'm not sure if they're relevant (probably not). They were in sgemm.cpp:

#if defined(__F16C__)
template <> inline __m256 load(const ggml_fp16_t *p) {
    return _mm256_cvtph_ps(_mm_loadu_si128((const __m128i *)p));
}
#endif // __F16C__

Although I think I've skipped something in ggml.c by accident, too. These were before even loading a model, however. I guess I'll repeat this again. Edit: I think these were all just due to it trying to embed in the background. But I have this set to CPU for the time being, anyway.

Other than that, ASan didn't complain, so I guess that maybe points to the driver? Or it's because of what it complains about there (see log). Although from what I've read, ASan isn't a panacea, either.

Attaching ASan output: asan-log-01.txt

Edit: Only tangentially related, but turns out I can get a few more symbols & sources:

~~Nvidia symbol server: https://driver-symbols.nvidia.com/~~ Don't bother, it doesn't include PDBs
PDB for vulkan-1.dll in (at least that's the version in my System32): https://sdk.lunarg.com/sdk/download/1.3.280.0/windows/VulkanRT-1.3.280.0-Components.zip

source for that:

git clone https://github.com/KhronosGroup/Vulkan-Loader.git
cd Vulkan-Loader
git checkout vulkan-sdk-1.3.280.0

cebtenzzre commented 3 months ago

Not sure why you say that, it's consistently in the clean-up routine of the Kompute Manager where it detects something amiss.

vk::Instance::destroy is clear enough, but vkResetEvent -> FreeLibrary -> LdrUnloadDll -> memset -> RtlRemoveVectoredContinueHandler makes no sense. It seems like it is having trouble resolving the function names, and simply picking something nearby to provide the offset.

At least with this kind of build, I'm seeing the locals instead of it telling me they've been optimised away. The library it tries to unload is indeed nvoglv64.dll. The second one in the list to unload would've been igvk64.dll.

This is really difficult, since the crash is happening internally to the NVIDIA Vulkan driver. If it was better written, it would complain if GPT4All was doing something incorrect instead of trying to free a null pointer. But possibly we are invoking undefined behavior somewhere, and we are lucky there aren't demons flying out of our noses. It's hard to say since the driver in question is proprietary, and as far as I know we are using its API exactly as intended.

Other than that, ASan didn't complain, so I guess that maybe points to the driver? Or it's because of what it complains about there (see log). Although from what I've read, ASan isn't a panacea, either.

It can't do bounds checking inside of dependent library code that wasn't compiled with it enabled, but I have never seen a false positive from it.

By the way, I don't understand much about Vulkan, but is there a good reason to repeatedly (= whenever a model is un/loaded) create and destroy that Vulkan Instance? From what I've seen in the meantime, shouldn't it just be once when Vulkan is initialised (= I guess the first time a model is loaded through Vulkan) and then clean-up at application exit?

The difficult part is that llama.cpp's ggml-backend interface, which is basically modeled after what CUDA and Metal need, has no well-defined concept of global program initialization/cleanup. The test suite certainly doesn't call anything like that, so to get it to pass without leaking the Vulkan instance and all of its associated state requires cleaning it up once the model and context are both freed.

You might think - why not just use a destructor on a global static object? The problem with this is that the dynamically loaded DLLs for the Vulkan driver also use a similar mechanism to do cleanup when the program exits. Unfortunately, we need a functioning Vulkan driver in order to call vkDestroyInstance, so we tend to crash during exit if we do it this way. See also #2843.

cosmic-snow commented 3 months ago

vk::Instance::destroy is clear enough, but vkResetEvent -> FreeLibrary -> LdrUnloadDll -> memset -> RtlRemoveVectoredContinueHandler makes no sense. It seems like it is having trouble resolving the function names, and simply picking something nearby to provide the offset.

Ah right, I didn't bother with what it said at that point in the stack. Can only get as far as they let you look with this kind of thing, I guess, and I'm not up to hunting down objects & allocations without proper tools or go through assembly. I don't even really know enough about the existing codebase, anyway.

It can't do bounds checking inside of dependent library code that wasn't compiled with it enabled, but I have never seen a false positive from it.

I meant that it sometimes can't catch where it goes off the rails.

The difficult part is that llama.cpp's ggml-backend interface, which is basically modeled after what CUDA and Metal need, has no well-defined concept of global program initialization/cleanup. ...

So what's the architectural difference (if any) between this one and the other Vulkan backend? Sounds like that might have the same problem then? Maybe I should try Vulkan backends with some other llama.cpp based application at some point and see if that fails somewhere, too.

So anyway, with how flaky this looks, and I've said earlier:

I think there is some kind of race condition.

Here's another idea: Maybe put a lock around both instantiation and clean-up of that Instance (and maybe other resources) if it's not possible to have it as a global resource? If there isn't one already in place, that is. I don't know how all of this is coordinated and I don't know if there are also special primitives for that in Vulkan, but if possible, it can't hurt to try.

Setting DISABLE_LAYER_NV_OPTIMUS_1=1 env var works for one of these cases, I've set it on/off quite a number of times now and this works, so I guess these two devices just don't play all that well together on this system. Optimus is supposed to "... seamlessly switch between two graphics adapters within a computer system ...". So I think it's possible the heap corruption could be due to some unsynchronised access somewhere.

Speculation on my behalf, but it's not unreasonable to assume the other case is due to some interplay between the two devices, either. Especially also because the GUI uses some GPU resources itself (although that's probably through DirectX).

mshakirDr commented 3 months ago

Qt itself does use GPU resources to draw the GUI, so that's probably where that activity on your system is coming from.

For me, that part is also running on the other GPU.

But thanks for the info, it can't be excluded that there's a connection. I might want to try changing how the workload is split at some point, too.

(Figuring these out is kind of low-priority for me because it doesn't look like this is widespread and they don't occur with typical usage of the application, either.)

I am running another Qt based application Audacious medial player. It does not use the integrated GPU that frequently. This behaviour is only specific to GPT4All as far as I can observe. I hope it can be confirmed further.

cosmic-snow commented 3 months ago

I am running another Qt based application Audacious medial player. It does not use the integrated GPU that frequently. This behaviour is only specific to GPT4All as far as I can observe. I hope it can be confirmed further.

Well, this issue is for crashes and for the Vulkan backend, specifically. If it doesn't actually crash for you, maybe have a look at #2538? That has not been solved yet.

nomic-ai / gpt4all