rust-windowing / winit

Window handling library in pure Rust
https://docs.rs/winit/
Apache License 2.0
4.86k stars 909 forks source link

crash in NSView autorelease #3090

Closed lunixbochs closed 4 months ago

lunixbochs commented 1 year ago

On a build of Talon with this winit branch: https://github.com/talonvoice/winit/commits/0.28 and macOS 13.5 (arm64), a user reported this crash:

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x00003c72007685d0
Exception Codes:       0x0000000000000001, 0x00003c72007685d0

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [19725]

VM Region Info: 0x3c72007685d0 is not in any region.  Bytes after previous region: 65979295368657  Bytes before following region: 39092784560688
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      commpage (reserved)        1000000000-7000000000   [384.0G] ---/--- SM=NUL  ...(unallocated)
--->  GAP OF 0x5f9000000000 BYTES
      MALLOC_NANO              600000000000-600008000000 [128.0M] rw-/rwx SM=PRV  

Kernel Triage:
VM - (arg = 0x0) pmap_enter retried due to resource shortage

Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libobjc.A.dylib                        0x19dabdc20 objc_msgSend + 32
1   AppKit                                 0x1a119ce30 -[NSView _finalize] + 300
2   AppKit                                 0x1a119cbec -[NSView dealloc] + 128
3   libobjc.A.dylib                        0x19dac40b4 AutoreleasePoolPage::releaseUntil(objc_object**) + 196
4   libobjc.A.dylib                        0x19dac0b7c objc_autoreleasePoolPop + 256
5   CoreFoundation                         0x19def659c _CFAutoreleasePoolPop + 32
6   CoreFoundation                         0x19e009c40 __CFRunLoopPerCalloutARPEnd + 48
7   CoreFoundation                         0x19df35904 __CFRunLoopDoObservers + 572
8   CoreFoundation                         0x19df35010 __CFRunLoopRun + 1028
9   CoreFoundation                         0x19df344b8 CFRunLoopRunSpecific + 612
10  HIToolbox                              0x1a7786df0 RunCurrentEventLoopInMode + 292
11  HIToolbox                              0x1a7786c2c ReceiveNextEventCommon + 648
12  HIToolbox                              0x1a7786984 _BlockUntilNextEventMatchingListInModeWithFilter + 76
13  AppKit                                 0x1a115b97c _DPSNextEvent + 636
14  AppKit                                 0x1a115ab18 -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 716
15  AppKit                                 0x1a114ef7c -[NSApplication run] + 464
16  Talon                                  0x104d199a0 0x104b78000 + 1710496
17  dyld                                   0x19dafff28 start + 2236

To my knowledge the only NSViews being created in my app are by winit and softbuffer. I also use glutin which seems to interact with NSViews via icrate, but it doesn't seem to create an NSView itself.

I interact with winit from a single thread only. This crash happened repeatedly for the user, and stopped after a macOS restart. They have a monitor connected via a displaylink dock.

I don't remember the specifics, but iirc objc sometimes messes with pointer bits after freeing objects, which could explain the weird pointer?

cc @madsmtm not sure if you have any ideas

madsmtm commented 1 year ago

On a build of Talon with this winit branch: https://github.com/talonvoice/winit/commits/0.28 and macOS 13.5 (arm64), a user reported this crash

Was the user using macOS 13.5, or is it only the build that happen there?

I also use glutin

Are you using both softbuffer and glutin in the same application? May I ask why?

I don't remember the specifics, but iirc objc sometimes messes with pointer bits after freeing objects, which could explain the weird pointer?

Hmm, not that I know of? If anything, it'll just be that since the object is freed, the space would've been reclaimed by some other part of the system, and hence the pointer would be some other unrelated data, not a pointer any more.

This crash happened repeatedly for the user, and stopped after a macOS restart.

Yikes! That makes it basically impossible to debug, as we can no longer reproduce the issue; otherwise I'd have suggested rerunning with malloc scribbling and such enabled:

DYLD_INSERT_LIBRARIES=/usr/lib/libgmalloc.dylib MallocStackLogging=YES NSZombieEnabled=YES MallocGuardEdges=YES MallocScribble=YES ./target/debug/my_binary

Honestly though, this sounds more like some weird macOS bug?

lunixbochs commented 1 year ago

Was the user using macOS 13.5

Yes

Are you using both softbuffer and glutin in the same application? May I ask why?

Maybe 1% of my users don't have a working gpu/driver/opengl, so I automatically fall back to software rendering. This happens sometimes on macos with displaylink monitors, on windows if their gpu driver is broken, and honestly just a lot of Linux users are running a weird env with no gpu. (This is more work than you think! I render with Skia, even for egui, and fall back to Skia's software renderer. I also need to manually create my opengl context in a way where I can recover when the user doesn't have it. I have Metal support too, but it's disabled because it still has memory leaks rust-side I haven't tracked down since the rust port. Softbuffer also unfortunately doesn't support the alpha channel yet, which I use)

hence the pointer would be some other unrelated data, not a pointer any more

Ah yeah if that's the case I'll send them an ASAN build if it comes up again.

rerunning with malloc scribbling and such enabled

Thanks, I'll do this next time it pops back up, my guess is if I can't repro another user will hit it in 1-3 months.

Honestly though, this sounds more like some weird macOS bug?

My gut feeling is that it's not a macos bug. I've run into this sort of thing (very rarely) before the rust port and it was usually due to a misunderstanding about the objc reference counting or an unexpected interaction, e.g. incorrect usage of autoreleasepool.

Even bigger complication is this app is an accessibility client, and can look at its own UI, which can very much confuse UI frameworks due to the way AppKit calls back to itself from a deep call stack in the same thread (though I don't think the user was doing that). Talon is really an edge case factory. I've invested heavily in porting from Qt to Rust because I want more language level guarantees. It's gone well so far besides expect() calls in crates that should have been fallible (the app hard exits, which is really disconcerting for a user who is using it as primary input instead of a keyboard/mouse)

My biggest guess is it was caused by DisplayLink, which is basically the number 1 reason people with perfectly good machines+drivers mysteriously don't have a working gpu. It also may explain why a reboot may have fixed - their dock supports both DisplayLink and Alt Mode so my guess is it switched between them inadvertently. DisplayLink is rare enough that it could explain why more of my users haven't hit it yet. I'll grab a DisplayLink adapter and try to repro.

madsmtm commented 4 months ago

I'm going to close this, I believe we're fairly good at doing reference-counting nowadays (both in Winit and Softbuffer), and since it's not reproducible, it's not really actionable from our side. Feel free to re-open if your user hits this again!