realm / realm-java

Realm is a mobile database: a replacement for SQLite & ORMs
http://realm.io
Apache License 2.0
11.47k stars 1.75k forks source link

ANR in getInstance() #5386

Closed raja-baz closed 7 years ago

raja-baz commented 7 years ago

Expected Results

Get a realm instance :)

Actual Results

ANR. This comes from play store ANRs(on ~20k daily actives, we're getting this one about 1000 times a week, which is quite high):

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 obj=0x758b4598 self=0xb48a4500
  | sysTid=7073 nice=0 cgrp=bg_non_interactive sched=0/0 handle=0xb6feed54
  | state=D schedstat=( 0 0 0 ) utm=34 stm=19 core=2 HZ=100
  | stack=0xbe4d0000-0xbe4d2000 stackSize=8MB
  | held mutexes=
  native: pc 0000000000041eb8  /system/lib/libc.so (msync+12)
  native: pc 00000000000ef1bd  /data/app/com.anghami-2/lib/arm/librealm-jni.so (???)
  native: pc 00000000000986d7  /data/app/com.anghami-2/lib/arm/librealm-jni.so (???)
  native: pc 00000000000f4189  /data/app/com.anghami-2/lib/arm/librealm-jni.so (???)
  native: pc 00000000000f4b93  /data/app/com.anghami-2/lib/arm/librealm-jni.so (???)
  native: pc 0000000000075901  /data/app/com.anghami-2/lib/arm/librealm-jni.so (???)
  native: pc 0000000000077d2d  /data/app/com.anghami-2/lib/arm/librealm-jni.so (???)
  native: pc 0000000000084f5d  /data/app/com.anghami-2/lib/arm/librealm-jni.so (???)
  native: pc 00000000000854cd  /data/app/com.anghami-2/lib/arm/librealm-jni.so (???)
  native: pc 00000000000760eb  /data/app/com.anghami-2/lib/arm/librealm-jni.so (???)
  native: pc 0000000000030513  /data/app/com.anghami-2/lib/arm/librealm-jni.so (Java_io_realm_internal_SharedRealm_nativeGetSharedRealm+42)
  native: pc 00000000009f0519  /data/app/com.anghami-2/oat/arm/base.odex (Java_io_realm_internal_SharedRealm_nativeGetSharedRealm__JLio_realm_internal_RealmNotifier_2+116)
  at io.realm.internal.SharedRealm.nativeGetSharedRealm (SharedRealm.java)
  at io.realm.internal.SharedRealm.<init> (SharedRealm.java:180)
  at io.realm.internal.SharedRealm.<init> (SharedRealm.java:180)
  at io.realm.internal.SharedRealm.getInstance (SharedRealm.java:227)
  at io.realm.internal.SharedRealm.getInstance (SharedRealm.java:217)
  at io.realm.RealmCache.doCreateRealmOrGetFromCache (RealmCache.java:317)
- locked <0x07c608e4> (a io.realm.ar)
  at io.realm.RealmCache.createRealmOrGetFromCache (RealmCache.java:281)
  at io.realm.Realm.getInstance (Realm.java:346)
  at com.anghami.data.local.RealmHelper._getInstance (RealmHelper.java:205)
  at com.anghami.data.local.RealmHelper.getInstance (RealmHelper.java:188)
- locked <0x0a5f714d> (a java.lang.Class<com.anghami.data.c.h>)
  at com.anghami.data.local.RealmHelper.getLongLivedCacheRealm (RealmHelper.java:78)
  at com.anghami.data.local.FollowedItems.<init> (FollowedItems.java:222)
  at com.anghami.data.local.FollowedItems.initialize (FollowedItems.java:184)
  at com.anghami.AnghamiApplication.onCreate (AnghamiApplication.java:64)
  at android.app.Instrumentation.callApplicationOnCreate (Instrumentation.java:1036)
  at android.app.ActivityThread.handleBindApplication (ActivityThread.java:6317)
  at android.app.ActivityThread.access$1800 (ActivityThread.java:221)
  at android.app.ActivityThread$H.handleMessage (ActivityThread.java:1860)
  at android.os.Handler.dispatchMessage (Handler.java:102)
  at android.os.Looper.loop (Looper.java:158)
  at android.app.ActivityThread.main (ActivityThread.java:7225)
  at java.lang.reflect.Method.invoke! (Native method)
  at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run (ZygoteInit.java:1230)
  at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:1120)

Steps & Code to Reproduce

Can't reproduce easily as this comes from play store reports.

Version of Realm and tooling

Realm version(s): 3.7.2

Realm sync feature enabled: no

Which Android version and device: All android versions we support(5 through 8) have reports for this issue and many devices

raja-baz commented 7 years ago

We also have other ANRs that happen in getInstance(). Some of them have different traces in native. Will include another one here which we also get a lot of(maybe related, maybe not, I can't tell):

This one we get ~900 of per week. The stack trace on our end of the app varies, but once we hit io.realm.internal it's always the same. Also, the realm being queried here is encrypted

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 obj=0x7532a578 self=0xb7b576f8
  | sysTid=20429 nice=0 cgrp=default sched=0/0 handle=0xb6f9dde4
  | state=S schedstat=( 665686472 47291765 1057 ) utm=30 stm=36 core=3 HZ=100
  | stack=0xbe219000-0xbe21b000 stackSize=8MB
  | held mutexes=
  native: pc 00000000000169f8  /system/lib/libc.so (syscall+28)
  native: pc 00000000000f5ecd  /system/lib/libart.so (_ZN3art17ConditionVariable4WaitEPNS_6ThreadE+80)
  native: pc 00000000002867ab  /system/lib/libart.so (_ZN3art3JNI16NewWeakGlobalRefEP7_JNIEnvP8_jobject+602)
  native: pc 000000000005f9d3  /data/app/com.anghami-2/lib/arm/librealm-jni.so (???)
  native: pc 0000000000030609  /data/app/com.anghami-2/lib/arm/librealm-jni.so (Java_io_realm_internal_SharedRealm_nativeGetSharedRealm+288)
  native: pc 00000000009d0cb1  /data/app/com.anghami-2/oat/arm/base.odex (Java_io_realm_internal_SharedRealm_nativeGetSharedRealm__JLio_realm_internal_RealmNotifier_2+116)
  at io.realm.internal.SharedRealm.nativeGetSharedRealm (SharedRealm.java)
  at io.realm.internal.SharedRealm.<init> (SharedRealm.java:180)
  at io.realm.internal.SharedRealm.<init> (SharedRealm.java:180)
  at io.realm.internal.SharedRealm.getInstance (SharedRealm.java:227)
  at io.realm.internal.SharedRealm.getInstance (SharedRealm.java:217)
  at io.realm.RealmCache.doCreateRealmOrGetFromCache (RealmCache.java:317)
- locked <0x05a1c407> (a io.realm.ar)
  at io.realm.RealmCache.createRealmOrGetFromCache (RealmCache.java:281)
  at io.realm.Realm.getInstance (Realm.java:346)
Zhuinden commented 7 years ago

What is

at com.anghami.AnghamiApplication.onCreate (AnghamiApplication.java:64)

and

  at com.anghami.data.local.FollowedItems.<init> (FollowedItems.java:222)
  at com.anghami.data.local.FollowedItems.initialize (FollowedItems.java:184)

and

  at com.anghami.data.local.RealmHelper._getInstance (RealmHelper.java:205)
  at com.anghami.data.local.RealmHelper.getInstance (RealmHelper.java:188)
  at com.anghami.data.local.RealmHelper.getLongLivedCacheRealm (RealmHelper.java:78)

?

raja-baz commented 7 years ago

AnghamiApplication:

FollowedItems.initialize();

That's a singleton. initialize calls the constructor which gets an instance of realm to use for later:

FollowedItems(line 222)

mCacheRealm = RealmHelper.getLongLivedCacheRealm(listener);

The code in RealmHelper looks like this:

     74   public static RealmWrapper getLongLivedCacheRealm(@Nonnull RealmWrapper.Listener listener) {
     75     if (Looper.myLooper() == null) {
     76       throw new IllegalStateException("Long lived realms are only valid on threads with loopers");
     77     }
     78     return new RealmWrapper(getInstance(RealmConfig.sCacheRealmConfig), listener);
     79   }

Which calls through to getInstance with a specific configuration:

This is the relevant part of the top-level getInstance():

    186   private synchronized static Realm getInstance(RealmConfiguration configuration) {
    187     try {
    188       return _getInstance(configuration);
    189     } catch (RuntimeException e) {

And _getInstance

    201   private static Realm _getInstance(RealmConfiguration configuration) {
    202     if (configuration == null) {
    203       return Realm.getDefaultInstance();
    204     }
    205     return Realm.getInstance(configuration);
    206   }

Basically, we're initialising a singleton which holds on to a long-lived Realm instance(which it will be adding listeners to). There's some wrapping calls between that and Realm.getInstance() but really nothing special. The thread hangs in getInstance() on the Realm side.

Zhuinden commented 7 years ago

I'm actually not sure what's up, but I don't think the getInstance call needs to be synchronized. Realm is thread-local anyways and handles synchronization internally in the RealmCache.

I do not know if this affects anything, though.

raja-baz commented 7 years ago

I know it doesn't need to be, we do that for other reasons I've posted about elsewhere in issues. Basically we've had crashes due to corrupt realm files and the way we're handling that until we can figure out the corruption issue is by restoring from backups we make periodically (you can see the try/catch there) for that to work, all instances need to be closed. So we synchronize getInstance and send out a message to consumers to close their instances before proceeding with the restore.

This doesn't come into play here though for 2 reasons:

  1. We've had this ANR on previous versions which didn't have the recovery code or the synchronization
  2. This is during app startup and no other instances are open anyway, this is the first one

On Thu, Oct 5, 2017 at 3:15 PM Gabor Varadi notifications@github.com wrote:

I'm actually not sure what's up, but I don't think the getInstance call needs to be synchronized. Realm is thread-local anyways and handles synchronization internally.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/realm/realm-java/issues/5386#issuecomment-334446733, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNP7ke1R5Tsean1qlVqY7TgGP4nG7nIks5spMhfgaJpZM4Pu6ac .

nhachicha commented 7 years ago

It's hard to tell without seeing the entire code, but judging from the ANR stack it might be a deadlock situation. You can see that the main thread is holding a lock

...
locked <0x07c608e4> (a io.realm.ar)
...
locked <0x0a5f714d> 

Another thread may be waiting to obtain this lock (via an intrinsic/synchronize lock) you might see it in the trace of another thread by

waiting to lock ...

is this the entire ANR trace? have you tried to profile the code to see if the ANR is due to a long init or a deadlock?

raja-baz commented 7 years ago

@nhachicha :

  1. Long init: I seriously doubt it, This is just a call to Realm.getInstance() which should be really fast(and definitely not the several seconds it would take to trigger an ANR). Also, the trace inside the library(on the native end) is the exact same one for all reports. If it were long init you would expect it to land differently depending on the specific timing of when it got killed after taking too long. But just to be thorough, what would cause the init to take a long time? Lots of models(we don't have that many...)?

  2. Deadlock from the non-native side of things:

locked <0x07c608e4> (a io.realm.ar)

This is this(RealmCache:284):

    private synchronized <E extends BaseRealm> E doCreateRealmOrGetFromCache(RealmConfiguration configuration,
            Class<E> realmClass) {

And the other one:

locked <0x0a5f714d>

Is this in our own code(RealmHelper:186):

  private synchronized static Realm getInstance(RealmConfiguration configuration) {

The only way those particular locks would deadlock is if the following scenario occurred:

  1. Thread A acquires some lock in native code then calls either RealmHelper.getInstance() or RealmCache.doCreateOrGetFromCache on the java end.
  2. The main thread is now waiting on that lock in native code.

I find that highly unlikely. Also, as I mentioned above, we can discard the RealmHelper.getInstance() lock as a candidate because we have this same ANR from before we introduced that one.

In any case, I've attached the 2 full ANR reports which includes the states of all the threads. You can have a look at that if that helps.

anr_1.txt anr_2.txt

raja-baz commented 7 years ago

Another thing to notice here. The final call in the native stack trace is to /system/lib/libc.so (msync+12) Which looks like something to do mmap()? Maybe deadlocking with the garbage collector or something related?

Would it be possible to symbolise the native calls in librealmjni.so so we know that that code's doing?

nhachicha commented 7 years ago

call stack with symbols:

realm::MakeLinkVector::consume(unsigned int)
realm::util::msync(void*, unsigned int)
realm::SlabAlloc::attach_file(std::string const&, realm::SlabAlloc::Config&)
realm::SharedGroup::do_open(std::string const&, bool, bool, realm::SharedGroupOptions)
realm::SharedGroup::compact()
realm::Realm::Realm(realm::Realm::Config, std::shared_ptr<realm::_impl::RealmCoordinator>)
std::__shared_ptr<realm::Realm::make_shared_realm(realm::Realm::Config, std::shared_ptr<realm::_impl::RealmCoordinator>)::make_shared_enabler, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<realm::Realm::make_shared_realm(realm::Realm::Config, std::shared_ptr<realm::_impl::RealmCoordinator>)::make_shared_enabler>, realm::Realm::Config, std::shared_ptr<realm::_impl::RealmCoordinator> >(std::_Sp_make_shared_tag, std::allocator<realm::Realm::make_shared_realm(realm::Realm::Config, std::shared_ptr<realm::_impl::RealmCoordinator>)::make_shared_enabler> const&, realm::Realm::Config&&, std::shared_ptr<realm::_impl::RealmCoordinator>&&)
realm::_impl::RealmCoordinator::get_realm(realm::Realm::Config)
realm::Realm::get_shared_realm(realm::Realm::Config)
Java_io_realm_internal_SharedRealm_nativeGetSharedRealm

are you calling compact ?

Zhuinden commented 7 years ago

Maybe compactOnLaunch() parameter in RealmConfiguration?

raja-baz commented 7 years ago

We have 3 realms we use. Only 1 of them has compactOnLaunch(), I'll have a look at the stacktrace from a larger sample of reports and see if all of them are hitting that Realm

raja-baz commented 7 years ago

@Zhuinden @nhachicha Though one thing right off the bat, the second stack trace(in the first comment) is from a different group of ANR reports and that one is on a realm that doesn't have compactOnLaunch() though it does have encryption turned on

raja-baz commented 7 years ago

Can we get symbols for that as well?

raja-baz commented 7 years ago

So a lot of the ANRs do seem to be the specific realm with compactOnLaunch(). I'm going to remove that for now to hopefully reduce the number of ANRs. As for the others, that same group of ANRs seems to include cases that happen when committing a transaction:

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 obj=0x740a82a0 self=0xb4c76500
  | sysTid=4318 nice=0 cgrp=bg_non_interactive sched=0/0 handle=0xb6f32b34
  | state=D schedstat=( 594170122 262049630 907 ) utm=47 stm=12 core=3 HZ=100
  | stack=0xbe6b5000-0xbe6b7000 stackSize=8MB
  | held mutexes=
  native: pc 00000000000417ec  /system/lib/libc.so (msync+12)
  native: pc 00000000000ef1bd  /data/app/com.anghami-1/lib/arm/librealm-jni.so (???)
  native: pc 00000000000f4ef1  /data/app/com.anghami-1/lib/arm/librealm-jni.so (???)
  native: pc 00000000000f5d75  /data/app/com.anghami-1/lib/arm/librealm-jni.so (???)
  native: pc 00000000000f396f  /data/app/com.anghami-1/lib/arm/librealm-jni.so (???)
  native: pc 00000000000f3b35  /data/app/com.anghami-1/lib/arm/librealm-jni.so (???)
  native: pc 00000000000f4c65  /data/app/com.anghami-1/lib/arm/librealm-jni.so (???)
  native: pc 00000000000827cb  /data/app/com.anghami-1/lib/arm/librealm-jni.so (???)
  native: pc 00000000000762b1  /data/app/com.anghami-1/lib/arm/librealm-jni.so (???)
  native: pc 000000000002f22f  /data/app/com.anghami-1/lib/arm/librealm-jni.so (Java_io_realm_internal_SharedRealm_nativeCommitTransaction+22)
  native: pc 000000000061b87d  /data/app/com.anghami-1/oat/arm/base.odex (Java_io_realm_internal_SharedRealm_nativeCommitTransaction__J+80)
  at io.realm.internal.SharedRealm.nativeCommitTransaction (SharedRealm.java)
  at io.realm.internal.SharedRealm.commitTransaction (SharedRealm.java:237)
  at io.realm.BaseRealm.commitTransaction (BaseRealm.java:400)
  at io.realm.Realm.commitTransaction (Realm.java:132)
  at io.realm.Realm.executeTransaction (Realm.java:1350)
  at com.anghami.data.local.RealmHelper.executeTransaction (RealmHelper.java:102)
  at com.anghami.data.local.RealmHelper.executeTransaction (RealmHelper.java:87)
  at com.anghami.player.cache.CacheItem.init (CacheItem.java:54)
  at com.anghami.player.cache.CacheItem.<init> (CacheItem.java:49)
  at com.anghami.player.cache.OdinCache.get (OdinCache.java:50)
  at com.anghami.player.core.SimpleAnghamiPlayer.buildCacheDataSourceFactory (SimpleAnghamiPlayer.java:530)
  at com.anghami.player.core.SimpleAnghamiPlayer.buildMediaSource (SimpleAnghamiPlayer.java:500)
  at com.anghami.player.core.SimpleAnghamiPlayer.prepare (SimpleAnghamiPlayer.java:331)
  at com.anghami.player.core.OdinPlayer.refreshPlayers (OdinPlayer.java:319)
  at com.anghami.player.core.OdinPlayer.<init> (OdinPlayer.java:184)
  at com.anghami.player.core.OdinPlayer.getSharedInstance (OdinPlayer.java:200)
  at com.anghami.AnghamiApplication.onCreate (AnghamiApplication.java:97)
  at android.app.Instrumentation.callApplicationOnCreate (Instrumentation.java:1014)
  at android.app.ActivityThread.handleBindApplication (ActivityThread.java:4806)
  at android.app.ActivityThread.access$1600 (ActivityThread.java:154)
  at android.app.ActivityThread$H.handleMessage (ActivityThread.java:1452)
  at android.os.Handler.dispatchMessage (Handler.java:102)
  at android.os.Looper.loop (Looper.java:234)
  at android.app.ActivityThread.main (ActivityThread.java:5526)
  at java.lang.reflect.Method.invoke! (Native method)
  at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run (ZygoteInit.java:726)
  at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:616)

That trace seems to end up at the same place in the end, but is coming from a very different stack. I've attached the fully report as well. Note that the realm in question here does not have either compacting or encryption

anr_3.txt

nhachicha commented 7 years ago

here are the symbols for the stack trace without compact or encryption (anr_3)

  realm::Compare<realm::Equal, realm::StringData, realm::Subexpr, realm::Subexpr>::find_first(unsigned int, unsigned int) const
  realm::util::msync(void*, unsigned int)
  realm::GroupWriter::get_free_space()
  realm::GroupWriter::commit(unsigned int)
  realm::SharedGroup::low_level_commit(unsigned long long)
  realm::SharedGroup::do_commit()
  realm::SharedGroup::commit_and_continue_as_read()
  realm::_impl::RealmCoordinator::commit_write(realm::Realm&)
  realm::Realm::commit_transaction()
  Java_io_realm_internal_SharedRealm_nativeCommitTransaction

Looks like the ANR happens when trying to memory map (msync)...

According to the stack, you're committing a transaction on the main thread? is this commit still related to getInstance? you can work around this by doing the expensive initialization on a worker thread using getInstanceAsync

Let us know how it did behave without the compactOnLaunch

nhachicha commented 7 years ago

This may also be related to https://github.com/realm/realm-java/issues/5391 since the OP is reporting delays when using getDefaultInstance on his model

cmelchior commented 7 years ago

Closing due to no response.

Zhuinden commented 7 years ago

Maybe migration?

hmedat commented 6 years ago

Hello Guys, I'm still faceing this issue,

at io.realm.internal.OsSharedRealm.nativeGetSharedRealm (OsSharedRealm.java) at io.realm.internal.OsSharedRealm.<init> (OsSharedRealm.java:172) at io.realm.internal.OsSharedRealm.<init> (OsSharedRealm.java:172) at io.realm.internal.OsSharedRealm.getInstance (OsSharedRealm.java:219) at io.realm.BaseRealm.<init> (BaseRealm.java:124) at io.realm.BaseRealm.<init> (BaseRealm.java:124) at io.realm.BaseRealm.<init> (BaseRealm.java:93) at io.realm.BaseRealm.<init> (BaseRealm.java:93) at io.realm.BaseRealm.<init> (BaseRealm.java:93) at io.realm.Realm.<init> (Realm.java:153) at io.realm.Realm.createInstance (Realm.java:424) at io.realm.RealmCache.doCreateRealmOrGetFromCache (RealmCache.java:342) at io.realm.RealmCache.createRealmOrGetFromCache (RealmCache.java:282) at io.realm.Realm.getInstance (Realm.java:353) at com.opensooq.OpenSooq.util.RealmUtil.getRealm (RealmUtil.java:28) at com.opensooq.OpenSooq.config.dataSource.CountryLocalDataSource.getRealm (CountryLocalDataSource.java:60) at com.opensooq.OpenSooq.config.dataSource.CountryLocalDataSource.getMyCountry (CountryLocalDataSource.java:164) at com.opensooq.OpenSooq.util.CrashlyticsUtil.setupUserInfo (CrashlyticsUtil.java:38) at com.opensooq.OpenSooq.App.setupCrashlytics (App.java:173) at com.opensooq.OpenSooq.App.onCreate (App.java:95) at android.app.Instrumentation.callApplicationOnCreate (Instrumentation.java:1017) at android.app.ActivityThread.handleBindApplication (ActivityThread.java:4556) at android.app.ActivityThread.access$1400 (ActivityThread.java:150) at android.app.ActivityThread$H.handleMessage (ActivityThread.java:1340) at android.os.Handler.dispatchMessage (Handler.java:99) at android.os.Looper.loop (Looper.java:137) at android.app.ActivityThread.main (ActivityThread.java:5283) at java.lang.reflect.Method.invokeNative (Method.java) at java.lang.reflect.Method.invoke (Method.java:511) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run (ZygoteInit.java:1102) at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:869) at dalvik.system.NativeStart.main (NativeStart.java))

This report from Google play

screen shot 2018-04-23 at 4 02 30 pm screen shot 2018-04-23 at 4 02 37 pm

Version of Realm and tooling

Realm version(s): 4.2.0 Realm sync feature enabled: no Android Studio version: 3.0.1

StackStrace

screen shot 2018-04-23 at 4 10 37 pm screen shot 2018-04-23 at 4 17 10 pm screen shot 2018-04-23 at 4 13 00 pm screen shot 2018-04-23 at 4 14 03 pm screen shot 2018-04-23 at 4 14 46 pm screen shot 2018-04-23 at 4 15 02 pm

@cmelchior @beeender @nhachicha Tell me guys if you need any more info

nhachicha commented 6 years ago

regarding https://user-images.githubusercontent.com/4534097/39128803-67490d96-4711-11e8-9860-fb41bbcac665.png

I don't think this is a good pattern, since you return country which is a managed object then close it's Realm .... this might work since the Realm is not really closed until the refcount drops to 0, but you shouldn't rely on this side effect ...

Zhuinden commented 6 years ago

@nhachicha he retrieves RealmCountry and then creates Country.

This should work for non-Sync Realms.

hmedat commented 6 years ago

Why ANR is happening?

Zhuinden commented 6 years ago

Probably you have a migration that has transform() on many items