Crash due to alignment exception below SharedGroup::do_open on 32-bit iOS devices

bdash commented 8 years ago

With core v0.97.0, released late last week as part of Realm Cocoa v0.98.4, users are seeing crashes below SharedGroup::do_open on 32-bit iOS devices. This was reported to us in realm/realm-cocoa#3321.

The exception looks like so:

Exception Type:  EXC_BAD_ACCESS (SIGBUS)
Exception Subtype: EXC_ARM_DA_ALIGN at 0x0000000002e00114
Triggered by Thread:  0

The backtrace like so:

* thread #1: tid = 0x15f283, 0x340381a0 libsystem_platform.dylib`OSAtomicCompareAndSwap64Barrier + 8, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_ARM_DA_ALIGN, address=0x776c09c)
    frame #0: 0x340381a0 libsystem_platform.dylib`OSAtomicCompareAndSwap64Barrier + 8
    frame #1: 0x3403bf1e libsystem_pthread.dylib`_pthread_mutex_lock + 210
    frame #2: 0x007cb5ec Realm`realm::util::RobustMutex::is_valid() + 84
    frame #3: 0x0088144c Realm`realm::SharedGroup::do_open(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, realm::SharedGroup::DurabilityLevel, bool, char const*, bool) + 900
  * frame #4: 0x007b2834 Realm`realm::SharedGroup::open(this=0x163b3c00, repl=0x15dbe080, durability=durability_Full, encryption_key=0x00000000, allow_file_format_upgrade=true) + 284 at group_shared.hpp:810
    frame #5: 0x007b25ba Realm`realm::SharedGroup::SharedGroup(this=0x163b3c00, repl=0x15dbe080, durability=durability_Full, encryption_key=0x00000000, allow_file_format_upgrade=true) + 514 at group_shared.hpp:779
    frame #6: 0x007b23ae Realm`realm::SharedGroup::SharedGroup(this=0x163b3c00, repl=0x15dbe080, durability=durability_Full, encryption_key=0x00000000, allow_file_format_upgrade=true) + 62 at group_shared.hpp:778
    frame #7: 0x007acdee Realm`realm::Realm::open_with_config(realm::Realm::Config const&, std::__1::unique_ptr<realm::Replication, std::__1::default_delete<realm::Replication> >&, std::__1::unique_ptr<realm::SharedGroup, std::__1::default_delete<realm::SharedGroup> >&, std::__1::unique_ptr<realm::Group, std::__1::default_delete<realm::Group> >&) [inlined] std::__1::__unique_if<realm::SharedGroup>::__unique_single std::__1::make_unique<realm::SharedGroup, realm::Replication&, realm::SharedGroup::DurabilityLevel&, char const*, bool>(__args=0x15dbe080, __args=0x003fc4ac, __args=0x003fc4a4, __args=0x003fc4a3) + 100 at memory:3075
    frame #8: 0x007acd8a Realm`realm::Realm::open_with_config(config=0x15dbdfd8, history=0x15dbe020, shared_group=0x15dbe024, read_only_group=0x15dbe028) + 4034 at shared_realm.cpp:86
    frame #9: 0x007abba2 Realm`realm::Realm::Realm(this=0x15dbdfd0) + 470 at shared_realm.cpp:66
    frame #10: 0x007ad424 Realm`realm::Realm::Realm(this=0x15dbdfd0, config=<unavailable>) + 16 at shared_realm.cpp:65
    frame #11: 0x006a8a66 Realm`std::__1::shared_ptr<realm::Realm> std::__1::shared_ptr<realm::Realm>::make_shared<realm::Realm::Config>(realm::Realm::Config&&) [inlined] std::__1::__libcpp_compressed_pair_imp<std::__1::allocator<realm::Realm>, realm::Realm, 1u>::__libcpp_compressed_pair_imp<std::__1::allocator<realm::Realm>&, realm::Realm::Config&&, 0ul, 0ul>(this=0x15dbdfd0, __pc=piecewise_construct_t @ 0x003fd080, __first_args=tuple<std::__1::allocator<realm::Realm> &> @ 0x003fd07c, __second_args=tuple<realm::Realm::Config &&> @ 0x003fd078, (null)=__tuple_indices<0> @ 0x003fd074, (null)=__tuple_indices<0> @ 0x003fd070) + 144 at memory:2128
    frame #12: 0x006a89d6 Realm`std::__1::shared_ptr<realm::Realm> std::__1::shared_ptr<realm::Realm>::make_shared<realm::Realm::Config>(realm::Realm::Config&&) [inlined] std::__1::__compressed_pair<std::__1::allocator<realm::Realm>, realm::Realm>::__compressed_pair<std::__1::allocator<realm::Realm>&, realm::Realm::Config&&>(this=0x15dbdfd0, __pc=piecewise_construct_t @ 0x003fd05c, __first_args=tuple<std::__1::allocator<realm::Realm> &> @ 0x003fd058, __second_args=tuple<realm::Realm::Config &&> @ 0x003fd054) + 48 at memory:2391
    frame #13: 0x006a89a6 Realm`std::__1::shared_ptr<realm::Realm> std::__1::shared_ptr<realm::Realm>::make_shared<realm::Realm::Config>(realm::Realm::Config&&) [inlined] std::__1::__compressed_pair<std::__1::allocator<realm::Realm>, realm::Realm>::__compressed_pair<std::__1::allocator<realm::Realm>&, realm::Realm::Config&&>(this=0x15dbdfd0, __pc=piecewise_construct_t @ 0x003fd038, __first_args=tuple<std::__1::allocator<realm::Realm> &> @ 0x003fd034, __second_args=tuple<realm::Realm::Config &&> @ 0x003fd030) + 24 at memory:2394
    frame #14: 0x006a898e Realm`std::__1::shared_ptr<realm::Realm> std::__1::shared_ptr<realm::Realm>::make_shared<realm::Realm::Config>(realm::Realm::Config&&) [inlined] std::__1::__shared_ptr_emplace<realm::Realm, std::__1::allocator<realm::Realm> >::__shared_ptr_emplace<realm::Realm::Config>(this=0x15dbdfc0, __a=allocator<realm::Realm> @ 0x003fd018, __args=0x003fd8d8) + 330 at memory:3732
    frame #15: 0x006a8844 Realm`std::__1::shared_ptr<realm::Realm> std::__1::shared_ptr<realm::Realm>::make_shared<realm::Realm::Config>(realm::Realm::Config&&) [inlined] std::__1::__shared_ptr_emplace<realm::Realm, std::__1::allocator<realm::Realm> >::__shared_ptr_emplace<realm::Realm::Config>(this=0x15dbdfc0, __a=allocator<realm::Realm> @ 0x003fcff8, __args=0x003fd8d8) + 16 at memory:3733
    frame #16: 0x006a8834 Realm`std::__1::shared_ptr<realm::Realm> std::__1::shared_ptr<realm::Realm>::make_shared<realm::Realm::Config>(__args=0x003fd8d8) + 348 at memory:4308
    frame #17: 0x0069ca58 Realm`realm::_impl::RealmCoordinator::get_realm(realm::Realm::Config) [inlined] std::__1::enable_if<!(is_array<realm::Realm>::value), std::__1::shared_ptr<realm::Realm> >::type std::__1::make_shared<realm::Realm, realm::Realm::Config>(__args=0x003fd8d8) + 22 at memory:4672
    frame #18: 0x0069ca42 Realm`realm::_impl::RealmCoordinator::get_realm(this=0x15e98df0, config=<unavailable>) + 2550 at realm_coordinator.cpp:116
    frame #19: 0x007aebe8 Realm`realm::Realm::get_shared_realm(realm::Realm::Config) + 172 at shared_realm.cpp:183
    frame #20: 0x0078ae9a Realm`+[RLMRealm openSharedRealm:error:](self=0x00956104, _cmd="openSharedRealm:error:", config=0x15e98a88, outError=domain: nil - code: 4186848) + 154 at RLMRealm.mm:279
    frame #21: 0x0078b7a8 Realm`+[RLMRealm realmWithConfiguration:error:](self=0x00956104, _cmd="realmWithConfiguration:error:", configuration=0x15e98a80, error=domain: nil - code: 4186848) + 2096 at RLMRealm.mm:350

I was able to reproduce by doing the following:

cd realm-core
git checkout v0.97.0
REALM_ENABLE_ENCRYPTION=yes REALM_ENABLE_ASSERTIONS=yes sh build.sh config
REALM_COCOA_PLATFORMS="iphone" sh build.sh build-cocoa
cd ../realm-cocoa
git checkout v0.98.4
curl "https://gist.githubusercontent.com/bdash/d7ee7bfb4624c57f689f/raw/6bd1be5f75cc42c52ed94fbe7a0bf0f635c822c4/armv7-only.patch" | patch -p1 (hacks the Xcode projects to build for armv7 only)
open examples/ios/objc/RealmExamples.xcworkspace
Build and run the Migration scheme on an iOS device.

A git bisect pinpointed b78b70438dbe6dba89193c16a4fdc341f27b4f27 as the culprit for the breakage. Nothing obvious jumps out to me in that change, but I was able to verify this by switching between b78b70438dbe6dba89193c16a4fdc341f27b4f27 and its parent.

/cc @kspangsege

tgoyne commented 8 years ago

The change from placement-new on a mmaped buffer to copying a SharedInfo to the file is not valid because SharedInfo has pthread_mutex_t members. Because they're unlocked at the time it's unsurprising that it happens to work on many platforms, but POSIX doesn't require that memcpy on a process-shared mutex should work.

kspangsege commented 8 years ago

@tgoyne, I have a hard time making sense of that.

First of all, do we agree that the following scenario is supposed to work?

Create file with shared mutex the way core used to do it (by placement new on mapped file).
Reboot computer.
Map file and lock mutex.

If that works, why would the new scheme used by SharedGroup::do_open() not work? How can step 3 tell the difference between a mutex in a file prepared before a reboot, and one that was prepared by copying some bytes into a file? What is the significant difference?

kspangsege commented 8 years ago

@tgoyne, here is another scenario that is supposed to work as far as I know:

Run program to initialize mutex in file A.
Copy file A to file B.
Run another program that maps file B and locks the mutex.

Again, if this works, why not the scheme used in SharedGroup::do_open()?

kspangsege commented 8 years ago

As far as I can tell, POSIX only requires that a process shared mutex remains mapped at a fixed address from the point of view of process A while it is locked by process A.

In particular, when the mutex is not locked by anybody, its state is self contained, and can therefore be cloned and copied around as desired.

kspangsege commented 8 years ago

@tgoyne, I will be grateful if you can just hint at the kind of implementation of process shared mutexes that would invalidate the scheme currently used in SharedGroup::do_open().

bdash commented 8 years ago

This doesn't reproduce in the iOS simulator, only on an iOS device. Any iOS device capable of running iOS 9 should be able to reproduce this when following the steps I outlined above.

tgoyne commented 8 years ago

First of all, do we agree that the following scenario is supposed to work?

No, I do not see anything in POSIX 2013 that even hints at either of those being a supported scenario. pthread_mutex_t is explicitly an opaque type that can only be manipulated with the provided functions, and PTHREAD_PROCESS_SHARED merely makes it so that it can be used by any thread which has access to the mutex, even if the mutex is created in shared memory. Mmaping a copy of the file would not give you access to the same memory.

kspangsege commented 8 years ago

Mmaping a copy of the file would not give you access to the same memory.

Of course not. I only meant to imply that the copy was itself a valid mutex that can be mapped and then locked.

kspangsege commented 8 years ago

No, I do not see anything in POSIX 2013 that even hints at either of those being a supported scenario.

I agree that POSIX is not explicit about it, but from the man pages it is clear that the following scenario is valid:

Program A creates a file and initializes a process shared mutex in that file.
Program A terminates.
Time passes while nobody has the file open.
Program B opens the file, maps it, and locks the process shared mutex.

So what you are saying, is that this scenario is only valid if the system is not rebooted during step 3 (time passes).

I would find that very odd for several reasons:

The man page does not specifically prohibit the reboot scenario.
If it is not just a matter of the mutex state constituted by the bytes stored in the file, then it is not at all clear what I am allowed to do with the file during step 3 (time passes). It would seem to me that the man page would then have to be explicit about what you can and cannot do with the file. And it isn't.

kspangsege commented 8 years ago

And:

It is hard (if not impossible) for process B to tell whether the system was rebooted or not.

kspangsege commented 8 years ago

All I am saying, is that as far as I can see, one can deduce from the POSIX man pages that a process shared mutex has to have a self contained state (no critical information saved outside mutex object) while it is not locked by anybody (step 3, time passes).

It is of course a separate question whether iOS adheres to that rule or not.

kspangsege commented 8 years ago

All I am saying, is that as far as I can see, one can deduce from the POSIX man pages that a process shared mutex has to have a self contained state (no critical information saved outside mutex object) while it is not locked by anybody (step 3, time passes).

@tgoyne, please let me know if you know anything specific about the iOS implementation of process shared mutexes that makes it not adhere to this rule.

kspangsege commented 8 years ago

@tgoyne, please see "Process Shared Memory and Synchronization" in the man page of pthread_mutexattr_init. Note especially the semaphore example, and this remark:

In particular, these processes may exist beyond the lifetime of the initializing process.

kspangsege commented 8 years ago

Here is a little extra analysis from me:

One can imagine that iOS allocates part of the mutex state in a slot of memory somewhere outside the mutex object whenever a shared mutex is initialized. In that case the slot must be identified during any subsequent access to the mutex, and that would invalidate the reboot scenario.

However, since a file containing a process shared mutex can be created via one map, and then later accessed via another map at a a different address (even inside the same process, or thread), we can safely assume that the address of the mutex object is not stored (at least not for any critical purpose) while the mutex is not locked by anybody.

Further more, even if iOS behaves like this, the current implementation of SharedGroup::do_open() would still work, as the mutex object is still initialized before being used.

bdash commented 8 years ago

A little digging around in Apple's pthread implementation suggests that the alignment of info in the following block from SharedGroup::do_open is the relevant factor:

// Write an initialized SharedInfo structure to the file, but with
// init_complete = 0.
SharedInfo info(durability, history_type);
m_file.write(reinterpret_cast<char*>(&info), sizeof info); // Throws

Information about the alignment of a mutex is stored within the pthread_mutex structure during pthread_mutex_init. If the alignment of info differs from the alignment of the SharedInfo structure in the mapped region, the alignment information in the pthread_mutex structures will not match the alignment of the addresses the mutexes are mapped at. Changing the declaration of info to alignas(64) SharedInfo gives the structure the same alignment as when it's remapped, and we no longer crash.

finnschiermer commented 8 years ago

@bdash: Thx for digging into this, we'll make a fix.

kspangsege commented 8 years ago

@bdash, thanks a lot for pinpointing the problem. One thing, though, I am assuming that you meant to align on a 64 bit boundary, which means that alignas(8) will suffice. Note that alignas() takes a "number of bytes" argument.

Also thanks a lot for the precise repro instructions.

@finnschiermer will implement a fix and release another version of core today.

realm / realm-core

Crash due to alignment exception below SharedGroup::do_open on 32-bit iOS devices #1558