rare deadlock during `objc_send_initialize()`

ERobsham commented 1 year ago

While investigating some intermittent deadlocks that were appearing across multiple processes in an Objective-C based suite of software I work on, I ended up finding a lock order reversal issue in the objc_send_initialize() locking flow.

The Stack Trace

First step in the investigation was gathering stack traces from the different processes, and once I got full debug symbols into the build, I noticed they all shared a similar signature:

Thread A:
#0  0x0000ffff9307cca4 in __GI___clock_nanosleep (..trimmed..) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x0000ffff9308228c in __GI___nanosleep (..trimmed..) at nanosleep.c:27
#2  0x0000ffff93082164 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x0000ffff9322a9d8 in lock_spinlock (spinlock=0xffff9325622c <spinlocks+3964>) at /libobjc2/9d790b4118/git/spinlock.h:77
#4  referenceListForObject (object=0xffff9391cc90 <._OBJC_METACLASS_GSFileHandle>, create=<optimized out>) at /libobjc2/9d790b4118/git/associate.m:294
#5  0x0000ffff9322aebc in objc_sync_enter (object=0x0) at libobjc2/9d790b4118/git/associate.m:398
#6  0x0000ffff9321b6d0 in objc_send_initialize (object=<optimized out>) at libobjc2/9d790b4118/git/dtable.c:711
#7  0x0000ffff93223b2c in objc_msg_lookup_internal (receiver=0xffffef18e1e8, selector=0xffff939485f0 <objc_selector_class_#160:8>, version=0x0) at /libobjc2/9d790b4118/git/sendmsg2.c:107
#8  objc_msg_lookup_sender (receiver=0xffffef18e1e8, selector=0xffff939485f0 <objc_selector_class_#160:8>, sender=<optimized out>) at /libobjc2/9d790b4118/git/sendmsg2.c:200
#9  0x0000ffff9368f56c in +[NSFileHandle initialize] (self=0xffff938e1b60 <._OBJC_CLASS_NSFileHandle>, _cmd=<optimized out>) at NSFileHandle.m:95
... propritary code ...
#20 0x0000aaaabc9ff818 in main (argc=<optimized out>, argv=<optimized out>) at main.m:58

Thread B:
#0  __lll_lock_wait (futex=0xffff93255020 <runtime_mutex>, private=0) at lowlevellock.c:52
#1  0x0000ffff932619f0 in __GI___pthread_mutex_lock (mutex=0xffff93255020 <runtime_mutex>) at pthread_mutex_lock.c:115
#2  0x0000ffff9322b288 in allocateHiddenClass (superclass=0xffff938c45f0 <._OBJC_CLASS_GSMutableDictionary>) at /libobjc2/9d790b4118/git/associate.m:222
#3  initHiddenClassForObject (obj=0xffff8000dc48) at /libobjc2/9d790b4118/git/associate.m:231
#4  0x0000ffff9322aac0 in referenceListForObject (object=0xffff8000dc48, create=<optimized out>) at /libobjc2/9d790b4118/git/associate.m:317
#5  0x0000ffff9322aebc in objc_sync_enter (object=0xffff93255020 <runtime_mutex>) at /libobjc2/9d790b4118/git/associate.m:398
#6  0x0000ffff943bd260 in -[propritaryClass methodThatInvokesAPropertyAccessor] ...
... propritary code ...
#13 0x0000ffff937aaafc in nsthreadLauncher (thread=<optimized out>) at NSThread.m:1327
#14 0x0000ffff9325f478 in start_thread (arg=0xffffef18e4f6) at pthread_create.c:477
#15 0x0000ffff930ae75c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

(note: this is simplified with all proprietary classes/methods removed. The full trace includes may more threads, with most in a similar state as 'Thread B')

The Deadlock

Thread A:

objc_send_initialize() holds the runtime lock, then tries to acquire the object lock on the metaclass @ dtable.c:711
referenceListForObject() needs to initialize the mutex for the new metaclass, so it tries to lock_spinlock()

Thread B:

referenceListForObject() holds a spinlock for an unrelated object
while running initHiddenClassForObject() -> allocateHiddenClass() @ associate.m:317, we try to acquire the runtime lock.

If the metaclass object pointer hash in 'Thread A' collides with hash for the object pointer in thread B which already has the spinlock, the runtime lock ends up deadlocked forever, causing a cascade deadlocking across many, if not all threads in a process.

A Potential Fix

This seems like it is a fairly straight forward lock order reversal issue, so I made an attempt to resolve this (see details here).

After letting a script run that was previously reproducing this issue after ~10-100 restarts of the software suite I work on, it has reached well over 5k restart cycles without hitting this deadlock while running this change. Will be posting a PR with this potential fix shortly.

davidchisnall commented 1 year ago

Thanks, that's a great report and the fix looks almost right.

ERobsham commented 1 year ago

Closing this issue now where the fix is merged.

gnustep / libobjc2