While investigating some intermittent deadlocks that were appearing across multiple processes in an Objective-C based suite of software I work on, I ended up finding a lock order reversal issue in the objc_send_initialize() locking flow.
The Stack Trace
First step in the investigation was gathering stack traces from the different processes, and once I got full debug symbols into the build, I noticed they all shared a similar signature:
Thread A:
#0 0x0000ffff9307cca4 in __GI___clock_nanosleep (..trimmed..) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1 0x0000ffff9308228c in __GI___nanosleep (..trimmed..) at nanosleep.c:27
#2 0x0000ffff93082164 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3 0x0000ffff9322a9d8 in lock_spinlock (spinlock=0xffff9325622c <spinlocks+3964>) at /libobjc2/9d790b4118/git/spinlock.h:77
#4 referenceListForObject (object=0xffff9391cc90 <._OBJC_METACLASS_GSFileHandle>, create=<optimized out>) at /libobjc2/9d790b4118/git/associate.m:294
#5 0x0000ffff9322aebc in objc_sync_enter (object=0x0) at libobjc2/9d790b4118/git/associate.m:398
#6 0x0000ffff9321b6d0 in objc_send_initialize (object=<optimized out>) at libobjc2/9d790b4118/git/dtable.c:711
#7 0x0000ffff93223b2c in objc_msg_lookup_internal (receiver=0xffffef18e1e8, selector=0xffff939485f0 <objc_selector_class_#160:8>, version=0x0) at /libobjc2/9d790b4118/git/sendmsg2.c:107
#8 objc_msg_lookup_sender (receiver=0xffffef18e1e8, selector=0xffff939485f0 <objc_selector_class_#160:8>, sender=<optimized out>) at /libobjc2/9d790b4118/git/sendmsg2.c:200
#9 0x0000ffff9368f56c in +[NSFileHandle initialize] (self=0xffff938e1b60 <._OBJC_CLASS_NSFileHandle>, _cmd=<optimized out>) at NSFileHandle.m:95
... propritary code ...
#20 0x0000aaaabc9ff818 in main (argc=<optimized out>, argv=<optimized out>) at main.m:58
Thread B:
#0 __lll_lock_wait (futex=0xffff93255020 <runtime_mutex>, private=0) at lowlevellock.c:52
#1 0x0000ffff932619f0 in __GI___pthread_mutex_lock (mutex=0xffff93255020 <runtime_mutex>) at pthread_mutex_lock.c:115
#2 0x0000ffff9322b288 in allocateHiddenClass (superclass=0xffff938c45f0 <._OBJC_CLASS_GSMutableDictionary>) at /libobjc2/9d790b4118/git/associate.m:222
#3 initHiddenClassForObject (obj=0xffff8000dc48) at /libobjc2/9d790b4118/git/associate.m:231
#4 0x0000ffff9322aac0 in referenceListForObject (object=0xffff8000dc48, create=<optimized out>) at /libobjc2/9d790b4118/git/associate.m:317
#5 0x0000ffff9322aebc in objc_sync_enter (object=0xffff93255020 <runtime_mutex>) at /libobjc2/9d790b4118/git/associate.m:398
#6 0x0000ffff943bd260 in -[propritaryClass methodThatInvokesAPropertyAccessor] ...
... propritary code ...
#13 0x0000ffff937aaafc in nsthreadLauncher (thread=<optimized out>) at NSThread.m:1327
#14 0x0000ffff9325f478 in start_thread (arg=0xffffef18e4f6) at pthread_create.c:477
#15 0x0000ffff930ae75c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(note: this is simplified with all proprietary classes/methods removed. The full trace includes may more threads, with most in a similar state as 'Thread B')
The Deadlock
Thread A:
objc_send_initialize() holds the runtime lock, then tries to acquire the object lock on the metaclass @ dtable.c:711
referenceListForObject() needs to initialize the mutex for the new metaclass, so it tries to lock_spinlock()
Thread B:
referenceListForObject() holds a spinlock for an unrelated object
while running initHiddenClassForObject() -> allocateHiddenClass() @ associate.m:317, we try to acquire the runtime lock.
If the metaclass object pointer hash in 'Thread A' collides with hash for the object pointer in thread B which already has the spinlock, the runtime lock ends up deadlocked forever, causing a cascade deadlocking across many, if not all threads in a process.
A Potential Fix
This seems like it is a fairly straight forward lock order reversal issue, so I made an attempt to resolve this (see details here).
After letting a script run that was previously reproducing this issue after ~10-100 restarts of the software suite I work on, it has reached well over 5k restart cycles without hitting this deadlock while running this change. Will be posting a PR with this potential fix shortly.
While investigating some intermittent deadlocks that were appearing across multiple processes in an Objective-C based suite of software I work on, I ended up finding a lock order reversal issue in the
objc_send_initialize()
locking flow.The Stack Trace
First step in the investigation was gathering stack traces from the different processes, and once I got full debug symbols into the build, I noticed they all shared a similar signature:
(note: this is simplified with all proprietary classes/methods removed. The full trace includes may more threads, with most in a similar state as 'Thread B')
The Deadlock
Thread A:
objc_send_initialize()
holds the runtime lock, then tries to acquire the object lock on the metaclass @dtable.c:711
referenceListForObject()
needs to initialize the mutex for the new metaclass, so it tries tolock_spinlock()
Thread B:
referenceListForObject()
holds a spinlock for an unrelated objectinitHiddenClassForObject()
->allocateHiddenClass()
@associate.m:317
, we try to acquire the runtime lock.If the metaclass object pointer hash in 'Thread A' collides with hash for the object pointer in thread B which already has the spinlock, the runtime lock ends up deadlocked forever, causing a cascade deadlocking across many, if not all threads in a process.
A Potential Fix
This seems like it is a fairly straight forward lock order reversal issue, so I made an attempt to resolve this (see details here).
After letting a script run that was previously reproducing this issue after ~10-100 restarts of the software suite I work on, it has reached well over 5k restart cycles without hitting this deadlock while running this change. Will be posting a PR with this potential fix shortly.