realm / realm-kotlin

Kotlin Multiplatform and Android SDK for the Realm Mobile Database: Build Better Apps Faster.
Apache License 2.0
889 stars 52 forks source link

Increasing memory usage on writes #1618

Open cmelchior opened 6 months ago

cmelchior commented 6 months ago

It looks like we might have a memory leak somewhere involving writes. A simple loop like this running on Android shows a small continuous increase in memory usage

viewModelScope.launch {
   while(true) {
       realm.write {
           copyToRealm(Sample().apply { stringField = Utils.createRandomString(1024*1024) }
       }
   }
}

It can take several minutes (if not more for it to show, so I am still not 100% sure there is a leak, but we have a customer is running into crashes after several hours with a loop that writes 25 times pr. second)

sync-by-unito[bot] commented 6 months ago

➤ rorbech commented:

This is public variant of HELP-53315 to investigate potential memory leak on writes ... though we cannot currently replicate it.

cmelchior commented 5 months ago

This seems to replicate it on And Pixel 5 (API 33) emulator:

  GlobalScope.launch(Dispatchers.Default) {
    while(true) {
      val p: Float = Random.nextFloat()
      val now = Instant.now()

      viewModelScope.launch(Dispatchers.IO) {
        kotlin.runCatching {
          realm.write {
            copyToRealm(Pressure().apply {
              timestamp = now.toEpochMilli()
              hPa = p
            })
          }
        }.onFailure { Log.e("TAG", it.stackTraceToString(), it) }
      }

      val frequency =
        if (lastTime != now)
          (1000.0 / (between(lastTime, now).toMillis())).roundToInt()
        else
          0

      lastTime = now

      _text.postValue(
        "${
          startTime.atZone(ZoneId.systemDefault()).toLocalDateTime()
            .truncatedTo(ChronoUnit.SECONDS)
        }\n" +
                "${between(startTime, now).toHoursMinutesSecondsShort()}\n" +
                "${"%.4f".format(p)}\n" +
                "$frequency Hz")
    }
  }
class Pressure : RealmObject
{
  @PrimaryKey
  var _id: ObjectId = ObjectId()
  var outingId: ObjectId? = null
  @Index
  var timestamp: Long = 0L
  var hPa: Float = 0.0F
  var hPa0: Float? = null
  var hPa0Observed: Boolean? = null
  var slope: Int? = null
  var accuracy: Int? = null

  val meters get() = hPaToMeters(hPa, hPa0)
  val feet get() = metersToFeet(meters)

  val timeString get() = timestamp.toLocalTimeString()
  val dateString get() = timestamp.toLocalDateString()

  var time: Instant
    get() = Instant.ofEpochMilli(timestamp)
    set(value)
    {
      timestamp = value.toEpochMilli()
    }
}

At least it shows an increase in the Other memory region that does not come down again

cmelchior commented 5 months ago

After more testing, it seems to be related to the number of active versions:

image

Modifying the code to

  GlobalScope.launch(Dispatchers.Default) {
    while(true) {
      val p: Float = Random.nextFloat()
      val now = Instant.now()
      viewModelScope.launch(Dispatchers.IO) {
          realm.write<Unit> {
            copyToRealm(Pressure().apply {
              timestamp = now.toEpochMilli()
              hPa = p
            })
          }
      }
      _text.postValue("Version: ${realm.getNumberOfActiveVersions()}")
    }
  }

Show that the number of active versions keep increasing even though nothing should hold on to them. My best guess is that that it is our internal GC that isn't fast enough to keep up with the massive number of references being created.

cmelchior commented 5 months ago

Surprisingly, the above run actually eventually seemed to catch up:

image

But then crashed because it went OOM:

 java.lang.OutOfMemoryError: Failed to allocate a 24 byte allocation with 1744152 free bytes and 1703KB until OOM, target footprint 201326592, growth limit 201326592; failed due to fragmentation (largest possible contiguous allocation 0 bytes). Number of 256KB sized free regions are: 0
                                                                                                        at com.oliverclimbs.realmtest.ui.home.HomeViewModel$1$1.invokeSuspend(HomeViewModel.kt:67)
                                                                                                        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
                                                                                                        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
                                                                                                        at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:115)
                                                                                                        at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:100)
                                                                                                        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
                                                                                                        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
                                                                                                        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
                                                                                                        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)
                                                                                                        Suppressed: kotlinx.coroutines.internal.DiagnosticCoroutineContextException: [StandaloneCoroutine{Cancelling}@85dc4cc, Dispatchers.IO]
cmelchior commented 5 months ago

After looking more into this, it looks like this behavior can be explained by our RealmFinalizer thread either not keeping up or pointers simply not being GC'ed in time.

By adding an atomic counter to the finalizer, I was able to show that the reference queue keeps growing with the following code:

    GlobalScope.launch(Dispatchers.Default) {
      while(true) {
        val p: Float = Random.nextFloat()
        val now = Instant.now()
        viewModelScope.launch(Dispatchers.IO) {
            realm.write<Unit> {
              copyToRealm(Pressure().apply {
                timestamp = now.toEpochMilli()
                hPa = p
              })
            }
        }
        withContext(Dispatchers.Main) {
          _text.postValue("Version: ${realm.getNumberOfActiveVersions()}")
        }
      }
    }

I could see incremental bursts of things being GC'ed, but the overall trend was that the queue kept growing and growing.... just pausing the writes didn't help either. My guess there is that the memory allocator didn't consider the many thousand NativePointers important enough to GC.

The result of this would either be that 1) We ran out of Disk space because the Realm file kept growing (because of unclaimed versions) or 2) We went OOM because we exhausted the native memory space.

Only by stopping the writes and then manually calling the GC was I able to empty the queue.

This is not ideal in "fast write"-scenarios like listening to sensor updates.

I tried to modify our GC thread to have Max priority. This seemed to help a little bit, but the queue of pointers was still growing.

So right now I guess that for these scenarios, we need some sort of "allocation-free" insert, or at least an insert that automatically cleans up as soon as the write is completed.

In Realm Java we have a bulk insert method called insert() that does this, and it is tracked for Kotlin here: https://github.com/realm/realm-kotlin/issues/959. I would guess this kind of method would also fix the problem described in this issue.

cmelchior commented 5 months ago

A solution to this problem is most likely something like: https://github.com/realm/realm-kotlin/issues/959

OluwoleOyetoke commented 1 month ago

Any chance fixing this issue will be prioritized soon? @cmelchior , Did you figure out any workaround in the meantime?