Open CannibalVox opened 1 year ago
You need to use LockOSThread on Windows since you an only render on the main thread.
The linked code does no rendering. Additionally, as I mention in the description above, that isn't true of vulkan.
cc @golang/runtime
This repro has 100 goroutines that concurrently call C.start
and C.end
, which mutate the globals instance
and device
. I'm not familiar with Vulkan, but that kind of race condition seems unlikely to be safe. Does this still reproduce if you remove the race condition? (Either by using a single goroutine, locking the globals, or creating unique instance/device per goroutine)
They do not call it concurrently
I use a channel to ensure that only one runs at a time
Doh, apologies.
SUCCEED - Manjaro 6.2.9.1 - GeForce RTX 3060 mobile v530.410
In triage, we think that trying to reproduce with LockOSThread
would give us some useful information, even though in theory it shouldn't be necessary. Also might be helpful to try and reproduce with the GC disabled (GOGC=off
) to see what happens.
CC @qmuntal for some Windows expertise, maybe?
The difficulty for us in reproducing it is we don't have a readily-available Windows box that has Vulkan library installed on it. (Is that necessary? I assume so, but maybe I'm missing something.)
In triage, we think that trying to reproduce with LockOSThread would give us some useful information, even though in theory it shouldn't be necessary. Also might be helpful to try and reproduce with the GC disabled (GOGC=off) to see what happens.
It cannot be reproduced with LockOSThread, as I mention in the bug description, and I turn the GC off in the linked code using debug.SetGCPercent(-1)
. The GC running will repro this by the way, since it spins up goroutines and waits on them- that's what originally caused the problem and caused me to start investigating this.
I don't think that Vulkan is a vital part of the issue, but I don't know what about Vulkan is causing it to trigger, the couple of simple things I tried to repro without Vulkan didn't work.
I have walked other users through setting up Vulkan to repro this issue (I had to, in order to get the AMD repro) so I can do so with others who have a windows machine if they have the time. It involves installing mingw.
CC @qmuntal for some Windows expertise, maybe?
Can't reproduce this issue using an NVIDIA Quadro T1000 and the latest Vulkan SDK.
@CannibalVox could you share the output of vulkaninfo --summary
? Here is mine:
vulkaninfo --summary
==========
VULKANINFO
==========
Vulkan Instance Version: 1.3.224
Instance Extensions: count = 17
-------------------------------
VK_EXT_debug_report : extension revision 10
VK_EXT_debug_utils : extension revision 2
VK_EXT_direct_mode_display : extension revision 1
VK_EXT_swapchain_colorspace : extension revision 4
VK_KHR_device_group_creation : extension revision 1
VK_KHR_display : extension revision 23
VK_KHR_external_fence_capabilities : extension revision 1
VK_KHR_external_memory_capabilities : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2 : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2 : extension revision 1
VK_KHR_portability_enumeration : extension revision 1
VK_KHR_surface : extension revision 25
VK_KHR_surface_protected_capabilities : extension revision 1
VK_KHR_win32_surface : extension revision 6
VK_NV_external_memory_capabilities : extension revision 1
Instance Layers: count = 8
--------------------------
VK_LAYER_KHRONOS_profiles Khronos Profiles layer 1.3.243 version 1
VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer 1.3.243 version 1
VK_LAYER_KHRONOS_validation Khronos Validation Layer 1.3.243 version 1
VK_LAYER_LUNARG_api_dump LunarG API dump layer 1.3.243 version 2
VK_LAYER_LUNARG_gfxreconstruct GFXReconstruct Capture Layer Version 0.9.19 1.3.243 version 36883
VK_LAYER_LUNARG_monitor Execution Monitoring Layer 1.3.243 version 1
VK_LAYER_LUNARG_screenshot LunarG image capture layer 1.3.243 version 1
VK_LAYER_NV_optimus NVIDIA Optimus layer 1.3.224 version 1
Devices:
========
GPU0:
apiVersion = 4206816 (1.3.224)
driverVersion = 2216148992 (0x8417c000)
vendorID = 0x10de
deviceID = 0x1fb9
deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
deviceName = Quadro T1000 with Max-Q Design
driverID = DRIVER_ID_NVIDIA_PROPRIETARY
driverName = NVIDIA
driverInfo = 528.95
conformanceVersion = 1.3.3.1
deviceUUID = 4b0115d4-e3bd-e983-c3b0-92f9a06fd48c
driverUUID = fcf93a5f-db54-5d85-a8b0-b89bd3de837b
Here's mine:
VULKANINFO
==========
Vulkan Instance Version: 1.3.236
Instance Extensions: count = 14
-------------------------------
VK_EXT_debug_report : extension revision 10
VK_EXT_debug_utils : extension revision 2
VK_EXT_swapchain_colorspace : extension revision 4
VK_KHR_device_group_creation : extension revision 1
VK_KHR_external_fence_capabilities : extension revision 1
VK_KHR_external_memory_capabilities : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2 : extension revision 1
VK_KHR_portability_enumeration : extension revision 1
VK_KHR_surface : extension revision 25
VK_KHR_surface_protected_capabilities : extension revision 1
VK_KHR_win32_surface : extension revision 6
VK_NV_external_memory_capabilities : extension revision 1
Instance Layers: count = 16
---------------------------
VK_LAYER_EOS_Overlay Vulkan overlay layer for Epic Online Services 1.2.136 version 1
VK_LAYER_EOS_Overlay Vulkan overlay layer for Epic Online Services 1.2.136 version 1
VK_LAYER_EOS_Overlay Vulkan overlay layer for Epic Online Services 1.2.136 version 1
VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer 1.2.182 version 1
VK_LAYER_KHRONOS_validation Khronos Validation Layer 1.2.182 version 1
VK_LAYER_LUNARG_api_dump LunarG API dump layer 1.2.182 version 2
VK_LAYER_LUNARG_device_simulation LunarG device simulation layer 1.2.182 version 1
VK_LAYER_LUNARG_gfxreconstruct GFXReconstruct Capture Layer Version 0.9.8 1.2.182 version 36872
VK_LAYER_LUNARG_monitor Execution Monitoring Layer 1.2.182 version 1
VK_LAYER_LUNARG_screenshot LunarG image capture layer 1.2.182 version 1
VK_LAYER_NV_GPU_Trace_release_public_2022_1_1 NVIDIA Nsight Graphics GPU Trace interception layer 1.3.202 version 1
VK_LAYER_NV_nomad_release_public_2022_1_1 NVIDIA Nsight Graphics interception layer 1.3.202 version 1
VK_LAYER_NV_optimus NVIDIA Optimus layer 1.3.236 version 1
VK_LAYER_RENDERDOC_Capture Debugging capture layer for RenderDoc 1.3.131 version 18
VK_LAYER_VALVE_steam_fossilize Steam Pipeline Caching Layer 1.3.207 version 1
VK_LAYER_VALVE_steam_overlay Steam Overlay Layer 1.3.207 version 1
Devices:
========
GPU0:
apiVersion = 1.3.236
driverVersion = 531.61.0.0
vendorID = 0x10de
deviceID = 0x2484
deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
deviceName = NVIDIA GeForce RTX 3070
driverID = DRIVER_ID_NVIDIA_PROPRIETARY
driverName = NVIDIA
driverInfo = 531.61
conformanceVersion = 1.3.3.1
deviceUUID = 4a0c1b9d-6e15-fb16-a86d-34f7b0b034bb
driverUUID = 19d654b8-3b11-5754-9ade-bfbc8881d00e
[vulkaninfo (1).txt](https://github.com/golang/go/files/11399410/vulkaninfo.1.txt)
Attached is the full vulkaninfo for the AMD machine that repro'd the issue vulkaninfo (1).txt
You may have to run it 4 or 5 times to reproduce, if you haven't done that
Hm, both of us have steam & epic overlay layers, it's possible they're at fault.
One of the linux users who tested for us had the steam overlay, but none had the epic overlay so this may not be an OS thing.
OK, uninstalled the epic launcher and I can no longer repro, so the layers it installs is a piece of the puzzle here. I still think go is part of the formula, because of the number of things you can do in go that aren't visible to EOS that don't repro. However, I'm definitely more willing to believe that epic is doing something fundamentally wrong than vulkan. There's a strong possibility that windows isn't part of the formula, too, I'm going to see what my options are for getting EOS installed on a non-windows OS for testing.
OK the layers are windows-only so this seems like it's probably not actually a problem with go's windows support, if it is a problem with go. I'm going to do additional investigation to see if I can tease out the various possibilities:
Thanks. It sounds like Epic might do something that requires (possibly undocumented) things to run on the same thread. And you may want to actually use runtime.LockOSThread
when interacting with it.
Well, no, that's not it either. As I mentioned in the task description, there's all kinds of ways you can interact on different threads that don't repro the issue. That's why this is so confusing. Epic can't know about some of the things that are required to repro the issue. If it's not an issue with go, then epic is modifying some go memory, or maybe the system stack, in a way that causes issues but only when you do this exact set of things.
š¤
Does spinning up a new goroutine interact with the system stack at all? I'm kind of being drawn toward the "epic modifies the system stack" theory but I don't understand why you have to spin up goroutines after calling into epic for it to repro.
Other question: is there a good way to compare the system stack before vs. after a call into epic to verify whether epic is modifying it?
Sorry, you did indeed already answer both of my questions in the original bug. š Do you have any example crashes to share?
Just throwing something out there: I wonder if maybe some Go-specific TLS data is being clobbered causing really weird crashes. (I think if such state were broken, you'd still eventually fail even with LockOSThread. And maybe that does happen, it just takes a lot longer to fail?)
I wonder if maybe some Go-specific TLS data is being clobbered causing really weird crashes. (I think if such state were broken, you'd still eventually fail even with LockOSThread. And maybe that does happen, it just takes a lot longer to fail?)
It might only be getting clobbered when you call from different threads or system stacks. But we end up with the same situation of "why do you have to spin up new goroutines, and only after the create call, for it to fail". It might be like you said, it just takes a whole lot longer?
You may have to run it 4 or 5 times to reproduce, if you haven't done that
Yep, I had it running several hours without failing. Well, now its clear why, I don't have the offending extension. @CannibalVox do you know from where can I get it?
If you install the epic launcher from https://store.epicgames.com/en-US/download it will install it as part of the update process. When you see a login modal, that should mean it's installed.
I'm afraid I can't reproduce the issue even though Epic overlay seems to be correctly installed.
@CannibalVox you have a bunch of other layers that I don't have. Could it be that Epic is interacting wrongly with one of those?
Well, I uninstalled everything but epic and it's still reproducing for me:
$ vulkaninfo --summary
==========
VULKANINFO
==========
Vulkan Instance Version: 1.3.236
Instance Extensions: count = 14
-------------------------------
VK_EXT_debug_report : extension revision 10
VK_EXT_debug_utils : extension revision 2
VK_EXT_swapchain_colorspace : extension revision 4
VK_KHR_device_group_creation : extension revision 1
VK_KHR_external_fence_capabilities : extension revision 1
VK_KHR_external_memory_capabilities : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2 : extension revision 1
VK_KHR_portability_enumeration : extension revision 1
VK_KHR_surface : extension revision 25
VK_KHR_surface_protected_capabilities : extension revision 1
VK_KHR_win32_surface : extension revision 6
VK_NV_external_memory_capabilities : extension revision 1
Instance Layers: count = 10
---------------------------
VK_LAYER_EOS_Overlay Vulkan overlay layer for Epic Online Services 1.2.136 version 1
VK_LAYER_EOS_Overlay Vulkan overlay layer for Epic Online Services 1.2.136 version 1
VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer 1.2.182 version 1
VK_LAYER_KHRONOS_validation Khronos Validation Layer 1.2.182 version 1
VK_LAYER_LUNARG_api_dump LunarG API dump layer 1.2.182 version 2
VK_LAYER_LUNARG_device_simulation LunarG device simulation layer 1.2.182 version 1
VK_LAYER_LUNARG_gfxreconstruct GFXReconstruct Capture Layer Version 0.9.8 1.2.182 version 36872
VK_LAYER_LUNARG_monitor Execution Monitoring Layer 1.2.182 version 1
VK_LAYER_LUNARG_screenshot LunarG image capture layer 1.2.182 version 1
VK_LAYER_NV_optimus NVIDIA Optimus layer 1.3.236 version 1
Devices:
========
GPU0:
apiVersion = 1.3.236
driverVersion = 531.61.0.0
vendorID = 0x10de
deviceID = 0x2484
deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
deviceName = NVIDIA GeForce RTX 3070
driverID = DRIVER_ID_NVIDIA_PROPRIETARY
driverName = NVIDIA
driverInfo = 531.61
conformanceVersion = 1.3.3.1
deviceUUID = 4a0c1b9d-6e15-fb16-a86d-34f7b0b034bb
driverUUID = 19d654b8-3b11-5754-9ade-bfbc8881d00e
Also I was previously asked for an example crash, so here it is- this happens when the program is exiting and is the most common form this issue takes: it tries to switch to the system stack to call the exitcode syscall and it explodes with 0xc0000005 (access violation)
99
Exception 0xc0000005 0x8 0x7ff81e1c7370 0x7ff81e1c7370
PC=0x7ff81e1c7370
runtime: g 0: unknown pc 0x7ff81e1c7370
stack: frame={sp:0xe3f0fff768, fp:0x0} stack=[0x0,0xe3f0fffac0)
0x000000e3f0fff668: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff678: 0x00007ff8662f39ce 0x0000000000000000
0x000000e3f0fff688: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff698: 0x0000000000000000 0x0000000000000001
0x000000e3f0fff6a8: 0x0000025248c73748 0x00007ff8689bd2f0
0x000000e3f0fff6b8: 0x0000000000000000 0x000002524b0c26d0
0x000000e3f0fff6c8: 0x00007ff8688747b1 0x000000e3f0fff758
0x000000e3f0fff6d8: 0x000002521ebb0000 0x000000000000000c
0x000000e3f0fff6e8: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff6f8: 0x00007ff8663e0e40 0x000002524b0c2e60
0x000000e3f0fff708: 0x00007ff8662ff05b 0x000002524b0c26d0
0x000000e3f0fff718: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff728: 0x000000e3f0fff748 0x000000000000000c
0x000000e3f0fff738: 0x00007ff86630f96c 0x000002524b0c2a98
0x000000e3f0fff748: 0x0000000000000004 0x0000025200000004
0x000000e3f0fff758: 0x000000e3f0fff740 0x000002521ebbc438
0x000000e3f0fff768: <0x00007ff8688adee5 0x0000025200000000
0x000000e3f0fff778: 0x0000000000000014 0x0000000000000000
0x000000e3f0fff788: 0x0000000000000017 0x0000000000000000
0x000000e3f0fff798: 0x000000c000054000 0x0000000000000000
0x000000e3f0fff7a8: 0x00007ff8688adb9b 0x000000e3eeaab000
0x000000e3f0fff7b8: 0x000000c000115f30 0x000000c000115f60
0x000000e3f0fff7c8: 0x000000e3eea86000 0x0000000000000000
0x000000e3f0fff7d8: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff7e8: 0x0000000000000000 0x00000000006a0068
0x000000e3f0fff7f8: 0x000002521ebb35b2 0x0000000000000000
0x000000e3f0fff808: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff818: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff828: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff838: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff848: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff858: 0x0000000000000000 0x0000000000000000
runtime: g 0: unknown pc 0x7ff81e1c7370
stack: frame={sp:0xe3f0fff768, fp:0x0} stack=[0x0,0xe3f0fffac0)
0x000000e3f0fff668: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff678: 0x00007ff8662f39ce 0x0000000000000000
0x000000e3f0fff688: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff698: 0x0000000000000000 0x0000000000000001
0x000000e3f0fff6a8: 0x0000025248c73748 0x00007ff8689bd2f0
0x000000e3f0fff6b8: 0x0000000000000000 0x000002524b0c26d0
0x000000e3f0fff6c8: 0x00007ff8688747b1 0x000000e3f0fff758
0x000000e3f0fff6d8: 0x000002521ebb0000 0x000000000000000c
0x000000e3f0fff6e8: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff6f8: 0x00007ff8663e0e40 0x000002524b0c2e60
0x000000e3f0fff708: 0x00007ff8662ff05b 0x000002524b0c26d0
0x000000e3f0fff718: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff728: 0x000000e3f0fff748 0x000000000000000c
0x000000e3f0fff738: 0x00007ff86630f96c 0x000002524b0c2a98
0x000000e3f0fff748: 0x0000000000000004 0x0000025200000004
0x000000e3f0fff758: 0x000000e3f0fff740 0x000002521ebbc438
0x000000e3f0fff768: <0x00007ff8688adee5 0x0000025200000000
0x000000e3f0fff778: 0x0000000000000014 0x0000000000000000
0x000000e3f0fff788: 0x0000000000000017 0x0000000000000000
0x000000e3f0fff798: 0x000000c000054000 0x0000000000000000
0x000000e3f0fff7a8: 0x00007ff8688adb9b 0x000000e3eeaab000
0x000000e3f0fff7b8: 0x000000c000115f30 0x000000c000115f60
0x000000e3f0fff7c8: 0x000000e3eea86000 0x0000000000000000
0x000000e3f0fff7d8: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff7e8: 0x0000000000000000 0x00000000006a0068
0x000000e3f0fff7f8: 0x000002521ebb35b2 0x0000000000000000
0x000000e3f0fff808: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff818: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff828: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff838: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff848: 0x0000000000000000 0x0000000000000000
0x000000e3f0fff858: 0x0000000000000000 0x0000000000000000
goroutine 1 [running]:
runtime.systemstack_switch()
C:/Program Files/Go/src/runtime/asm_amd64.s:463 fp=0xc000115f08 sp=0xc000115f00 pc=0x7ff6d79ac5e0
runtime.stdcall(0xc000115f60?)
C:/Program Files/Go/src/runtime/os_windows.go:1074 +0x85 fp=0xc000115f40 sp=0xc000115f08 pc=0x7ff6d79810a5
runtime.stdcall1(0x7ff6d79e5938, 0x0)
C:/Program Files/Go/src/runtime/os_windows.go:1095 +0x5c fp=0xc000115f58 sp=0xc000115f40 pc=0x7ff6d79811bc
runtime.exit(0x0)
C:/Program Files/Go/src/runtime/os_windows.go:703 +0x49 fp=0xc000115f80 sp=0xc000115f58 pc=0x7ff6d797fee9
runtime.main()
C:/Program Files/Go/src/runtime/proc.go:274 +0x253 fp=0xc000115fe0 sp=0xc000115f80 pc=0x7ff6d7986693
runtime.goexit()
C:/Program Files/Go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000115fe8 sp=0xc000115fe0 pc=0x7ff6d79ae8e1
goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
C:/Program Files/Go/src/runtime/proc.go:381 +0xd6 fp=0xc000057fb0 sp=0xc000057f90 pc=0x7ff6d7986a56
runtime.goparkunlock(...)
C:/Program Files/Go/src/runtime/proc.go:387
runtime.forcegchelper()
C:/Program Files/Go/src/runtime/proc.go:305 +0xb2 fp=0xc000057fe0 sp=0xc000057fb0 pc=0x7ff6d7986872
runtime.goexit()
C:/Program Files/Go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000057fe8 sp=0xc000057fe0 pc=0x7ff6d79ae8e1
created by runtime.init.6
C:/Program Files/Go/src/runtime/proc.go:293 +0x25
goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
C:/Program Files/Go/src/runtime/proc.go:381 +0xd6 fp=0xc000059f80 sp=0xc000059f60 pc=0x7ff6d7986a56
runtime.goparkunlock(...)
C:/Program Files/Go/src/runtime/proc.go:387
runtime.bgsweep(0x0?)
C:/Program Files/Go/src/runtime/mgcsweep.go:278 +0x8e fp=0xc000059fc8 sp=0xc000059f80 pc=0x7ff6d79717ae
runtime.gcenable.func1()
C:/Program Files/Go/src/runtime/mgc.go:178 +0x26 fp=0xc000059fe0 sp=0xc000059fc8 pc=0x7ff6d7966a66
runtime.goexit()
C:/Program Files/Go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000059fe8 sp=0xc000059fe0 pc=0x7ff6d79ae8e1
created by runtime.gcenable
C:/Program Files/Go/src/runtime/mgc.go:178 +0x6b
goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc00001a0e0?, 0x7ff6d7a3a8b8?, 0x1?, 0x0?, 0x0?)
C:/Program Files/Go/src/runtime/proc.go:381 +0xd6 fp=0xc000069f70 sp=0xc000069f50 pc=0x7ff6d7986a56
runtime.goparkunlock(...)
C:/Program Files/Go/src/runtime/proc.go:387
runtime.(*scavengerState).park(0x7ff6d7aa3ea0)
C:/Program Files/Go/src/runtime/mgcscavenge.go:400 +0x53 fp=0xc000069fa0 sp=0xc000069f70 pc=0x7ff6d796f6d3
runtime.bgscavenge(0x0?)
C:/Program Files/Go/src/runtime/mgcscavenge.go:628 +0x45 fp=0xc000069fc8 sp=0xc000069fa0 pc=0x7ff6d796fcc5
runtime.gcenable.func2()
C:/Program Files/Go/src/runtime/mgc.go:179 +0x26 fp=0xc000069fe0 sp=0xc000069fc8 pc=0x7ff6d7966a06
runtime.goexit()
C:/Program Files/Go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000069fe8 sp=0xc000069fe0 pc=0x7ff6d79ae8e1
created by runtime.gcenable
C:/Program Files/Go/src/runtime/mgc.go:179 +0xaa
goroutine 5 [finalizer wait]:
runtime.gopark(0x1a0?, 0x7ff6d79e9e48?, 0xa0?, 0x4e?, 0xc00005bf70?)
C:/Program Files/Go/src/runtime/proc.go:381 +0xd6 fp=0xc00005be28 sp=0xc00005be08 pc=0x7ff6d7986a56
runtime.runfinq()
C:/Program Files/Go/src/runtime/mfinal.go:193 +0x107 fp=0xc00005bfe0 sp=0xc00005be28 pc=0x7ff6d7965ac7
runtime.goexit()
C:/Program Files/Go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x7ff6d79ae8e1
created by runtime.createfing
C:/Program Files/Go/src/runtime/mfinal.go:163 +0x45
rax 0x7ff81e1c7370
rbx 0x2524b0c6e30
rcx 0x2524b9a0000
rdi 0x2521ebbc438
rsi 0x17
rbp 0x9
rsp 0xe3f0fff768
r8 0xffffffff
r9 0x1
r10 0x0
r11 0xe3f0fff6a0
r12 0x0
r13 0x7ff8689bd2f0
r14 0x25248c73760
r15 0x1
rip 0x7ff81e1c7370
rflags 0x10206
cs 0x33
fs 0x53
gs 0x2b
@qmuntal I'm running via mingw, maybe that's part of the equation, I assume you're using powershell?
@qmuntal I'm running via mingw, maybe that's part of the equation, I assume you're using powershell?
Yes, I do use powershell. My mingw version is pretty new (see below), which one do you use?
$ gcc --version
gcc.exe (MinGW-W64 x86_64-ucrt-posix-seh, built by Brecht Sanders) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ gcc --version
gcc.exe (Rev5, Built by MSYS2 project) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
it tries to switch to the system stack to call the exitcode syscall and it explodes with 0xc0000005 (access violation)
Does your program registers any callbacks that is called when the program exit?
My code does not, I can't be certain that epic overlay isn't doing that.
Timed out in state WaitingForInfo. Closing.
(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)
Why was this in "waiting for info"? I provided everything I was asked for. @qmuntal @cherrymui
Apologies, that looks like a mistake to me.
What version of Go are you using (
go version
)?I also tried this with 1.8, 1.12, and 1.18
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Ran this program a few times in a row: https://github.com/CannibalVox/heapcorruptrepro
What did you expect to see?
Successful completion
What did you see instead?
It often succeeds, but it fails maybe a third to half of the time on Windows, and no other operating system. The nature of the failure is different at different times. Often, I'll see crashes on exit when attempting to run the exit syscall because the system stack has been corrupted. Sometimes the program will exit prematurely with exit code 0xc0000374 (corrupted heap). At other times, I will see access violation panics when calling C methods.
Because of the involvement of vulkan (and I can't tell exactly what aspect of vulkan is triggering the issue to reproduce it with a different library), it's easy to point the finger at vulkan. However, I do not believe vulkan per se is responsible:
This issue will only repro if the following four things all happen on the same goroutine. Moving any of them elsewhere or doing them in a different order works properly.
This is not simply a case of vulkan using thread context (for one thing, it doesn't)- we can perform create and destroy operations on any arbitrary goroutines all day long if we want to, as long as we don't follow the above instructions to the letter. Likewise, we can do the above on linux without difficulty.
Here are the 5 scenarios that I was able to try: SUCCEED - Ubuntu 22.04.2 LTS - GeForce RTX 4090 v525.105 SUCCEED - Ubuntu 22.10 - Intel(R) UHD Graphics v22.2.5 SUCCEED - Ubuntu 22.10 - GeForce RTX 3070 v525.105.17 FAIL - Windows 10 - GeForce RTX 3070 v531.61 FAIL - Windows 10 - Radeon 6800M v21.20.01.24
It's difficult to reproduce (given the fact that I can only figure out how to get vulkan to trigger it), but I believe that this is an issue with the go runtime. Vulkan would not be able to tell the difference between one goroutine on thread A creating objects and one go routine on thread B destroying them, and one goroutine creating objects, switching to thread B, and then destroying them. And it certainly can't tell whether go has spun up goroutines performing unrelated tasks between the two points. I'm concerned that this may indicate deeper issues with cgo on windows in the go runtime.