Off heap memory usage when building realm dbs increases steadily

christianblust commented 1 year ago

How frequently does the bug occur?

Always

Description

Hi all,

I am not sure if this problem is a bug or if I am using the library wrong in some way. However, here is my use case. I create multiple (currently around 15-20) realm databases in a job in my Spring-Boot 3 backend:

Stream csv blobs from azure storage (8mb per file, 40 files per realm db)
Map csv files to RealmObjects
Write data to realm
Close realm, call Realm.compact(config) to reduce numberOfVersions to 1, delete lock files
Zip realm file and persist
Apps can download prefilled realm dbs when data updates occur

Essentially I have two "problems".

First, I noticed that my heap usage increased with each iteration until finally the java process is killed (137, I suppose its oom). I can "mitigate" this a bit by calling System.gc() which is of course is not ideal.

Second, even with the System.gc() workaround I notice that the RSS of the java process on a linux azure app service grows way beyond my specified max heap. So I guess, there is some issue with off-heap memory? I do not have any expertise with this, but maybe someone has an idea.

Every 1.0s: ps aux --sort=-%mem | head -n 11                                                                                                      abc: Thu May 11 10:04:18 2023

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        48 70.8 83.3 9181524 6789168 ?     SNl  08:10  80:24 java -Xms1g -Xmx4g -jar /home/site/wwwroot/b/app.jar --server.port=80

I am not sure how I can really debug this on my end. I did enable the NativeMemoryTracking and generated a report which I will append.

Code:

azureStorage.getPrefixes().forEach { prefix ->
                    val blobs = azureStorage.listBlobsForPrefix(prefix)
                    val realmConfig = realmService.createRealmConfig(prefix)
                    val realm = realmService.createRealm(realmConfig)
                    realmService.writeStaticDataToRealm(realm, staticData1, staticData2, staticData3)
                    blobs.forEach { blobPath ->
                        val csvLines = convertCSVBytesToList(azureStorage.downloadBlobToInputStream(blobPath), CsvProperties.Separators.semicolon).chunked(CSV_CHUNK_SIZE)
                        csvLines.forEach { csvChunk ->
                            val mappedData = factory.mapCsvToBusinessObject(csvChunk)
                            realmService.saveToRealm(realm,  factory.createRealmObjects(mappedData), UpdatePolicy.ALL)
                        }
                    }
                    realmService.saveRealm(realm, realmConfig)
                }

RealmService:

    fun saveToRealm(realm: Realm, realmObjects: List<RealmObject>, updatePolicy: UpdatePolicy = UpdatePolicy.ERROR) {
        realm.writeBlocking { realmObjects.forEach { this.copyToRealm(it, updatePolicy) } }
        //System.gc()
    }

calling this with lots of realmObjects and UpdatePolicy.ALL is probably not very good regarding performance. It does a lot of "unnecessary" updates when inserting already existing RealmObjects. But I do not think that there is currently a better way to do that with kotlin? I saw other sdks have different UpdatePolicies for such usecases.

 fun saveRealm(realm: Realm, realmConfig: RealmConfiguration) {
        realm.close()
        Realm.compactRealm(realmConfig)

        // zip realmfile and save it to gridfs
        // deletes lock file & call Realm.close(config)
        cleanUpFiles(realmConfig)
        //System.gc()
    }

As I said, I am not sure if this is a bug or if my implementation is faulty. Any tips in the right direction would be greatly appreciated!

Stacktrace & log output

➜  ~ Native Memory Tracking:

(Omitting categories weighting less than 1KB)

Total: reserved=6996920KB, committed=2671224KB
       malloc: 114608KB #709425
       mmap:   reserved=6882312KB, committed=2556616KB

-                 Java Heap (reserved=5242880KB, committed=2097152KB)
                            (mmap: reserved=5242880KB, committed=2097152KB)

-                     Class (reserved=2166KB, committed=2166KB)
                            (classes #22459)
                            (  instance classes #21124, array classes #1335)
                            (malloc=2166KB #49758)
                            (  Metadata:   )
                            (    reserved=98304KB, committed=89728KB)
                            (    used=88988KB)
                            (    waste=740KB =0.83%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=14464KB)
                            (    used=13991KB)
                            (    waste=473KB =3.27%)

-                    Thread (reserved=215631KB, committed=215631KB)
                            (thread #130)
                            (stack: reserved=215288KB, committed=215288KB)
                            (malloc=192KB #779)
                            (arena=151KB #258)

-                      Code (reserved=52120KB, committed=31288KB)
                            (malloc=2584KB #15151)
                            (mmap: reserved=49536KB, committed=28704KB)

-                        GC (reserved=261624KB, committed=145176KB)
                            (malloc=33960KB #26950)
                            (mmap: reserved=227664KB, committed=111216KB)

-                  Compiler (reserved=375KB, committed=375KB)
                            (malloc=245KB #1297)
                            (arena=131KB #3)

-                  Internal (reserved=760KB, committed=760KB)
                            (malloc=728KB #5118)
                            (mmap: reserved=32KB, committed=32KB)

-                     Other (reserved=20732KB, committed=20732KB)
                            (malloc=20732KB #55)

-                    Symbol (reserved=25854KB, committed=25854KB)
                            (malloc=22554KB #594758)
                            (arena=3300KB #1)

-    Native Memory Tracking (reserved=11357KB, committed=11357KB)
                            (malloc=272KB #3916)
                            (tracking overhead=11085KB)

-               Arena Chunk (reserved=458KB, committed=458KB)
                            (malloc=458KB)

-                   Tracing (reserved=14570KB, committed=14570KB)
                            (malloc=14538KB #202)
                            (arena=32KB #1)

-                    Module (reserved=517KB, committed=517KB)
                            (malloc=517KB #3377)

-                 Safepoint (reserved=32KB, committed=32KB)
                            (mmap: reserved=32KB, committed=32KB)

-           Synchronization (reserved=393KB, committed=393KB)
                            (malloc=393KB #6796)

-            Serviceability (reserved=3KB, committed=3KB)
                            (malloc=3KB #16)

-                 Metaspace (reserved=98754KB, committed=90178KB)
                            (malloc=450KB #285)
                            (mmap: reserved=98304KB, committed=89728KB)

-      String Deduplication (reserved=1KB, committed=1KB)
                            (malloc=1KB #8)

-           Object Monitors (reserved=117KB, committed=117KB)
                            (malloc=117KB #574)

-                   Unknown (reserved=1048576KB, committed=14464KB)
                            (mmap: reserved=1048576KB, committed=14464KB)

Can you reproduce the bug?

Always

Reproduction Steps

No response

Version

1.8

What Atlas App Services are you using?

Local Database only

Are you using encryption?

No

Platform OS and version(s)

Linux, openjdk 17

Build environment

Spring Boot 3.0.4 Kotlin 1.8.0 Realm-Kotlin 1.8.0

clementetb commented 1 year ago

Thanks for the report we are investigating what might be the cause. While we do, would you mind trying to run a test where you generate the data rather than picking it up from a CSV?

We like to validate that the issue is in Realm and not in the CSV processing.

christianblust commented 1 year ago

Sorry for the late response, I am currently on vacation (until 23 of May). I had no time to generate comparable data, however it did run some tests. A script records the RSS size of the java app every 10, along with a visual vm session.

Baseline: Download data, map data, but without any realm interaction: RealmBuild_baseline_nosavingtorealm.zip

With the baseline test you can see that RSS does not exceed the current heap size that far (200-300mb, ActivityMonitor on Mac shows just roughly 30mb diff between Memory and "Real Memory" of the java process.

Download data, map data, write data to realm but no zipping + persisting of realmfile: RealmBuild_no_static_data_no_zip .zip

With realm interaction (saving mapped objects) the RSS exceeds current heap size by around 1gb

Download data, map data, write data to realm, zip realmfile, persisting realmfile RealmBuild_with_static_with_zip.zip

Full run shows also more memory usage, with also a big difference of RSS and heap size

christianblust commented 1 year ago

Hello again, I was wondering if you could provide your insights on this particular topic? We're quite interested in utilizing realm-kotlin for our project, but at this time, we have reservations about deploying it in a production environment. Although we're not under any pressing time constraints, it would be helpful to know if there's a possibility of addressing these concerns on your end. If that's not feasible, we might need to reconsider our design strategy. Naturally, I don't expect a definitive answer at this point, just seeking your thoughts.

cmelchior commented 1 year ago

Hi @christianblust

We are still investigating this, but it is a bit tricky to reproduce.

E.g. here is the memory dump of writing 10 realms, each containing 500 objects each containing 1 MB each.

    @Test
    fun memoryLeak() = runBlocking {
        println("Start")
        repeat(10) { noRealms ->
            val configuration =
                RealmConfiguration.Builder(schema = setOf(Sample::class))
                    .name("realm${noRealms}.realm")
                    .directory(tmpDir)
                    .build()
            val realm = Realm.open(configuration)
            var iteration = 0
            repeat(100) {
                realm.write {
                    repeat(5) {
                        copyToRealm(Sample())
                    }
                }
                delay(10.milliseconds)
                println("Iteration ${++iteration}")
            }
            realm.close()
            Realm.compactRealm(configuration)
        }
    }

As you can see, the memory usage is relatively stable.

However, that memory dump is from Android and we know from testing that the GC on Android is much more agressive than on JVM. In general what can be expensive is doing many small writes since that can lead to version pinning (which is normally what blows up the DB size), but judging from your code, you are only doing one write?

The reason GC is important is that we currently depend on it to release certain native resources, so there is a chance that what you are seeing is just a consequence of slower GCs. realm.close() will release the "main" resources, but all objects allocated must still be GC'ed.

For the creating objects case we are looking into adding a insert() method similar to Realm Java. This should be much more suited for this kind of "bulk-insert" workload, however it isn't clear if that is actually the problem you are running into.

Also, a note about your code. If you are doing all inserts in a single write, using compactRealm is not going to help you much, instead it will actually increase your memory usage since we are doing the compaction in a copy of the data before overriding the original file. This is the "spikes" you see in my graph.

Also, just to rule out it isn't an accidential memory leak in one of our data structures, is there a chance you can share your model class(es) so we have some idea what your objectgraph and properties look like?

christianblust commented 1 year ago

Hello @rorbech, @cmelchior,

I appreciate your detailed response!

The saveToRealm function is invoked at least n times for n csv files. A single database is built using ~40 csv files. To reduce load, I've implemented chunking, so the function is likely called even more times.

I'm open to sharing the model classes. Could we possibly communicate this through a private channel? Currently, I'm utilizing 27 classes that comprise various data structures, including RealmInstant, RealmList, ByteArray, etc.

I would like to point out that I have two classes that look like this:

class QuantityRealmObject() : RealmObject {
    @PrimaryKey
    var id = ""
    var value: Double = 0.0
    var rawUnit: String = ""
    constructor(
        value: Double,
        rawUnit: String,
    ) : this() {
        this.id = "$value$rawUnit"
        this.value = value
        this.rawUnit = rawUnit
    }

    override fun equals(other: Any?): Boolean {
        if (this === other) return true
        if (javaClass != other?.javaClass) return false

        other as QuantityRealmObject

        if (id != other.id) return false
        if (value != other.value) return false
        if (rawUnit != other.rawUnit) return false

        return true
    }

    override fun hashCode(): Int {
        var result = id.hashCode()
        result = 31 * result + value.hashCode()
        result = 31 * result + rawUnit.hashCode()
        return result
    }
}

and

class ConsumptionRealmObject() : RealmObject {
    @PrimaryKey
    var id: String = ""
    var min: QuantityRealmObject? = null
    var max: QuantityRealmObject? = null

    constructor(
        min: QuantityRealmObject?,
        max: QuantityRealmObject?
    ) : this() {
        this.id = "${min?.value}${min?.rawUnit}_${max?.value}${max?.rawUnit}"
        this.min = min
        this.max = max
    }

    override fun equals(other: Any?): Boolean {
        if (this === other) return true
        if (javaClass != other?.javaClass) return false

        other as VerbrauchRealmObject

        if (id != other.id) return false
        if (min != other.min) return false
        if (max != other.max) return false

        return true
    }

    override fun hashCode(): Int {
        var result = id.hashCode()
        result = 31 * result + (min?.hashCode() ?: 0)
        result = 31 * result + (max?.hashCode() ?: 0)
        return result
    }
}

These classes are widely used across many model classes. While parsing csv data, it's possible that several Quantity or Consumption RealmObjects share the same PrimaryKey, due to the limited number of distinct objects. This, however, depends on the data provided in the csv files. Using UpdatePolicy.Error, as expected, triggers an IntegrityException. As a workaround, I've been using UpdatePolicy.ALL, which feels a bit odd, but it seems to work.

Another observation I made using Visual VM is that each database build spawns a "running" thread called "core-notifier". For instance, after a run that built 15 databases, Visual VM shows 15 running (green) core-notifier threads. I'm unsure if this is relevant in any way.

rorbech commented 1 year ago

@christianblust Thanks for the feedback. You can share your model classes to realm-help@mongodb.com.

The core-notifier shouldn't hold on to any of the resources in question, so shouldn't be the root of this. But I guess we never know before we can replicate your observations. We are currently not able to close these threads when closing the realm, but we have an issue adding the required hooks in core (https://github.com/realm/realm-core/issues/6429).

christianblust commented 1 year ago

Alright! I will get back with my colleagues and send you the files starting next week.

Would it be possible to check if this behaves differently with realm-java? I can't really get it working using the offical documentation and my project setup which is a kotlin, spring boot, gradle (kotlin syntax) backend.

christianblust commented 1 year ago

I just sent the model classes to the mentioned mail. As stated in the mail:

I tested using EmbeddedRealmObjects instead of the current approach with @PrimaryKey and UpdatePolicy.ALL. Naturally, this results in a larger realm db file. However, I don’t have a “leak” using this approach.

Hope this helps!

christianblust commented 1 year ago

Hi @rorbech ,

I've noticed recent PRs addressing JVM leaks and was wondering if these could potentially resolve the issues we've been encountering. Do you have some information on their relevance and possible release timeline? Best, Chris

rorbech commented 1 year ago

Hi @christianblust. Yes, we have identified some leaks and have merged fixes for them. The fixes are available in the 1.11.2-SNAPSHOT (https://github.com/realm/realm-kotlin#using-snapshots), so would be nice with some feedback if you see a difference. There is no exact date on the final release, but should be within a week or so.

christianblust commented 1 year ago

Would love to test! However, I can not get the gradle configuration to work. I have a simple spring boot backend service and no android project. I don't have a "global" and "module" build.gradle file, just one. I also must set the plugin version in the plugin block. I always get the error "plugin version not found". I also tried setting the snapshot repo as plugin repo in the settings.gradle. But if the final release is not that far and just a week away I will definitely try it and give some feedback.

realm / realm-kotlin