gnif / LookingGlass

An extremely low latency KVMFR (KVM FrameRelay) implementation for guests with VGA PCI Passthrough.
GNU General Public License v2.0
4.64k stars 257 forks source link

Host: Consistent divide by zero exception when using NVFBC after video driver recovery #1106

Closed rennr closed 8 months ago

rennr commented 8 months ago

Immediately after the Nvidia video driver recovered from an error (event log warning Display driver nvlddmkm stopped responding and has successfully recovered.), the looking-glass-host.exe repeatedly crashes every few seconds with the error below. For some reason, I can continue to connect to the host during this time and use it without any perceived side effects.

However, the crashes also cause hundreds of tray icons to fill up the task bar ad infinitum and it also is filling up the event log with crash reports.

00:00:00.872 [I]               app.c:408  | captureStart                   | ==== [ Capture Start ] ====
00:00:01.880 [E]             crash.c:88   | exception_filter               | ==== FATAL CRASH (B6-233-e5a9c024) ====
00:00:01.881 [E]             crash.c:89   | exception_filter               | exception 0xc0000094 (INT_DIVIDE_BY_ZERO), address is 00007ff7b8ea15f0
00:00:01.984 [E]             crash.c:131  | exception_filter               | [trace]:  1: C:\Program Files\Looking Glass (host)\looking-glass-host.exe:nvfbc_waitFrame+0xc0 (\home\builder\repos\gnif\LookingGlass\build\platform\Windows\capture\NVFBC\host\platform\Windows\capture\NVFBC\src\nvfbc.c:2b0+0x0)
00:00:01.985 [E]             crash.c:131  | exception_filter               | [trace]:  2: C:\Program Files\Looking Glass (host)\looking-glass-host.exe:sendFrame+0x67 (\home\builder\repos\gnif\LookingGlass\build\host\src\app.c:c5+0x0)
00:00:01.986 [E]             crash.c:131  | exception_filter               | [trace]:  3: C:\Program Files\Looking Glass (host)\looking-glass-host.exe:app_main+0x55c (\home\builder\repos\gnif\LookingGlass\build\host\src\app.c:3de+0x5)
00:00:01.987 [E]             crash.c:131  | exception_filter               | [trace]:  4: C:\Program Files\Looking Glass (host)\looking-glass-host.exe:appThread+0x30 (\home\builder\repos\gnif\LookingGlass\build\platform\Windows\host\platform\Windows\src\platform.c:119+0x0)
00:00:01.987 [E]             crash.c:131  | exception_filter               | [trace]:  5: C:\Program Files\Looking Glass (host)\looking-glass-host.exe:threadWrapper+0xf (\home\builder\repos\gnif\LookingGlass\build\common\src\platform\windows\common\src\platform\windows\thread.c:29+0x0)
00:00:01.988 [E]             crash.c:135  | exception_filter               | [trace]:  6: C:\Windows\System32\KERNEL32.DLL:BaseThreadInitThunk+0x1d
00:00:01.989 [E]             crash.c:135  | exception_filter               | [trace]:  7: C:\Windows\SYSTEM32\ntdll.dll:RtlUserThreadStart+0x28
00:00:00.010 [I]              time.c:85   | windowsSetTimerResolution      | System timer resolution: 500.0 μs
00:00:00.011 [I]               app.c:809  | app_main                       | Looking Glass Host (B6-233-e5a9c024)
00:00:00.012 [I]           cpuinfo.c:38   | cpuInfo_log                    | CPU Model: AMD Ryzen Threadripper 3960X 24-Core Processor
00:00:00.012 [I]           cpuinfo.c:39   | cpuInfo_log                    | CPU: 1 sockets, 12 cores, 24 threads
00:00:00.014 [I]           ivshmem.c:132  | ivshmemInit                    | IVSHMEM 0* on bus 0x0, device 0x4, function 0x0
00:00:00.026 [I]               app.c:826  | app_main                       | IVSHMEM Size     : 256 MiB
00:00:00.026 [I]               app.c:827  | app_main                       | IVSHMEM Address  : 0x2080B320000
00:00:00.026 [I]               app.c:828  | app_main                       | Max Pointer Size : 1024 KiB
00:00:00.027 [I]               app.c:829  | app_main                       | KVMFR Version    : 20
00:00:00.027 [I]               app.c:848  | app_main                       | Trying           : NVFBC
00:00:00.031 [I]         wrapper.cpp:94   | NvFBCInit                      | NvFBC SDK Version: 112
00:00:00.523 [I]             nvfbc.c:346  | nvfbc_init                     | DiffMap block    : 128x128
00:00:00.524 [I]             nvfbc.c:347  | nvfbc_init                     | Cursor mode      : decoupled
00:00:00.524 [I]               app.c:873  | app_main                       | Using            : NVFBC
00:00:00.524 [I]               app.c:874  | app_main                       | Capture Method   : Synchronous
00:00:00.525 [I]               app.c:725  | lgmpSetup                      | Max Frame Size   : 126 MiB
00:00:00.525 [I]               app.c:414  | captureStop                    | ==== [ Capture Stop ] ====

(And so on...)

Logs from the client:

renn@Renn-LX:~/Sources/looking-glass/client/build$ ./looking-glass-client -f /dev/kvmfr0 'win:title=VM View' win:size=1440x1080 audio:micDefault=allow
00:00:00.000 [I]              main.c:1859 | main                           | Looking Glass (B6-233-e5a9c024)
00:00:00.000 [I]              main.c:1860 | main                           | Locking Method: Atomic
00:00:00.001 [I]           cpuinfo.c:38   | cpuInfo_log                    | CPU Model: AMD Ryzen Threadripper 3960X 24-Core Processor
00:00:00.001 [I]           cpuinfo.c:39   | cpuInfo_log                    | CPU: 1 sockets, 24 cores, 48 threads
00:00:00.035 [I]              main.c:1185 | lg_run                         | Using font: /usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf
00:00:00.036 [I]           ivshmem.c:128  | ivshmemOpenDev                 | KVMFR Device     : /dev/kvmfr0
00:00:00.052 [I]             audio.c:159  | audio_init                     | Using AudioDev: PipeWire
00:00:00.052 [I]                ps.c:245  | purespice_connect              | Connecting to socket 127.0.0.1:5900
00:00:00.059 [I]               rsa.c:178  | rsa_encryptPassword            | Using Nettle
00:00:00.060 [I]                ps.c:268  | purespice_connect              | Connected
00:00:00.060 [I]             agent.c:103  | agent_connect                  | Connected to the spice guest agent
00:00:00.060 [I]      channel_main.c:167  | onMessage_mainName             | Guest name: win11
00:00:00.060 [I]      channel_main.c:183  | onMessage_mainUUID             | Guest UUID: ab39206f-5ab9-46f6-8998-5e1d69fdd57e
00:00:00.069 [I]                ps.c:644  | ps_connectChannel              | RECORD channel connected
00:00:00.084 [I]                ps.c:644  | ps_connectChannel              | PLAYBACK channel connected
00:00:00.103 [I]                ps.c:644  | ps_connectChannel              | INPUTS channel connected
00:00:00.103 [I]           channel.c:323  | onMessage_notify               | [notify] keyboard channel is insecure
00:00:00.103 [I]               egl.c:289  | egl_initialize                 | Double buffering is off
00:00:00.103 [I]              main.c:1139 | tryRenderer                    | Using Renderer: EGL
00:00:00.104 [I]           wayland.c:120  | waylandInit                    | Compositor: kwin_wayland_wr
00:00:00.104 [I]           wayland.c:130  | waylandInit                    | Selected  : xdg
00:00:00.134 [I]                gl.c:58   | waylandGetEGLDisplay           | Using eglGetPlatformDisplay
00:00:00.169 [I]               egl.c:856  | egl_renderStartup              | Single buffer mode
00:00:00.173 [I]               egl.c:883  | egl_renderStartup              | EGL     : 1.5
00:00:00.173 [I]               egl.c:884  | egl_renderStartup              | Vendor  : Intel
00:00:00.173 [I]               egl.c:885  | egl_renderStartup              | Renderer: Mesa Intel(R) Arc(tm) A750 Graphics (DG2)
00:00:00.173 [I]               egl.c:886  | egl_renderStartup              | Version : OpenGL ES 3.2 Mesa 23.2.1-1ubuntu3.1
00:00:00.173 [I]               egl.c:887  | egl_renderStartup              | EGL APIs: OpenGL OpenGL_ES 
00:00:00.173 [I]               egl.c:958  | egl_renderStartup              | Debug messages disabled, enable with egl:debug=true
00:00:00.208 [I]           eglutil.c:35   | swapWithDamageInit             | Using EGL_KHR_swap_buffers_with_damage
00:00:00.283 [I]              main.c:1590 | lg_run                         | Guest Information:
00:00:00.283 [I]              main.c:1591 | lg_run                         | Version  : B6-233-e5a9c024
00:00:00.283 [I]              main.c:1612 | lg_run                         | UUID     : ab39206f-5ab9-46f6-8998-5e1d69fdd57e
00:00:00.283 [I]              main.c:1621 | lg_run                         | CPU Model: AMD Ryzen Threadripper 3960X 24-Core Processor
00:00:00.283 [I]              main.c:1622 | lg_run                         | CPU      : 1 sockets, 12 cores, 24 threads
00:00:00.283 [I]              main.c:1624 | lg_run                         | Using    : NVFBC
00:00:00.284 [I]              main.c:1710 | lg_run                         | OS       : Windows
00:00:00.284 [I]              main.c:1712 | lg_run                         | OS Name  : Windows 10 Enterprise (Build: 22631) 
00:00:00.284 [I]              main.c:1734 | lg_run                         | Starting session
00:00:00.285 [I]              main.c:553  | main_frameThread               | Using DMA buffer support
00:00:00.679 [I]              main.c:710  | main_frameThread               | Format: FRAME_TYPE_RGBA10 3840x2160 (3840x2160) stride:3840 pitch:15360 rotation:0 hdr:0 pq:0
00:00:00.680 [E]    texture_dmabuf.c:247  | egl_texDMABUFUpdate            | Failed to create EGLImage for DMA transfer (EGL_BAD_ALLOC)
00:00:00.680 [W]           desktop.c:388  | egl_desktopUpdate              | DMA update failed, disabling DMABUF imports
00:00:03.620 [I]              main.c:710  | main_frameThread               | Format: FRAME_TYPE_RGBA10 3840x2160 (3840x2160) stride:3840 pitch:15360 rotation:0 hdr:1 pq:1
gnif commented 8 months ago

This is not a LG fault, the NVIDIA device has not recovered properly and is returning an invalid dwBufferWidth which causes the divide by zero error. There is nothing we can do about this as the bug is external to LG.

this->grabStride = this->grabInfo.dwBufferWidth;
this->shmStride = ALIGN_PAD(this->grabStride, 64);
const unsigned int maxHeight = maxFrameSize / (this->shmStride * this->bpp);