MongoDB slow for large databases

eimrek commented 1 year ago

I spent some time investigating the slowness of the li-ion-conductors optimade database.

Here's the info endpoint: https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/info

Accessing the /structures endpoint takes over 2 minutes.

I turned on performance profiling (db.setProfilingLevel(1)) and here's the part of the log for accessing the /structures endpoint (only the slow commands should show up here, meaning everything else was fast, i think):

/structures profiling (click to show)

``` > db.system.profile.find().pretty() { "op" : "command", "ns" : "li-ion-conductors.structures", "command" : { "aggregate" : "structures", "pipeline" : [ { "$match" : { } }, { "$group" : { "_id" : 1, "n" : { "$sum" : 1 } } } ], "cursor" : { }, "lsid" : { "id" : UUID("f822cd93-f176-4405-8c3a-062a5c7e79d8") }, "$db" : "li-ion-conductors" }, "keysExamined" : 0, "docsExamined" : 4396695, "cursorExhausted" : true, "numYield" : 7910, "nreturned" : 1, "locks" : { "FeatureCompatibilityVersion" : { "acquireCount" : { "r" : NumberLong(7913) } }, "ReplicationStateTransition" : { "acquireCount" : { "w" : NumberLong(7913) } }, "Global" : { "acquireCount" : { "r" : NumberLong(7913) } }, "Database" : { "acquireCount" : { "r" : NumberLong(7912) } }, "Collection" : { "acquireCount" : { "r" : NumberLong(7912) } }, "Mutex" : { "acquireCount" : { "r" : NumberLong(2) } } }, "flowControl" : { }, "storage" : { "data" : { "bytesRead" : NumberLong("17760670501"), "timeReadingMicros" : NumberLong(135074290) } }, "responseLength" : 141, "protocol" : "op_msg", "millis" : 139328, "planSummary" : "COLLSCAN", "ts" : ISODate("2023-08-22T14:51:08.005Z"), "client" : "172.18.0.1", "allUsers" : [ ], "user" : "" } ```

Some key points:

Basically, what seems to be the slow part, is just the counting of the total number of structures. ($match everything in the collection; group by $_id and then just sum the number).
it takes 139328 milliseconds to run
almost all of the time (135074 milliseconds) is spent on reading 17 GB of data from disk.
This command has to do a full scan of the documents (COLLSCAN). I don't think this command, in it's current state, could be sped up by using indexes.

I am wondering if this functionality could be implemented in a more efficient way. For example, db.structures.count() runs instantly.

Just for additional information, initially also accessing a single structure was as slow (2+ min), e.g. via

https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures/5b5b4b01-5b7e-48ad-8e17-8077f9b0b5d2

But after I added the id index with db.structures.createIndex({ id: 1 }), it's fast now.

Pinging @ml-evs @unkcpz @superstar54 for comments/ideas regarding the "counting" speedup. i suspect this is probably something that should be adapted in https://github.com/Materials-Consortia/optimade-python-tools?

ml-evs commented 1 year ago

Yeah, the counting is a pain but I don't see an easy way around it. Each query needs to know the total number of structures it returns in the response, according to the spec. The only speed-up we could do would be to implement some session-based cursor so that we don't have to repeat queries with a new skip every request (and thus don't have to repeat the count).

As the data_returned field is only a "SHOULD" in the specification, we could add a config option in optimade-python-tools that disables it (I can make a PR for this and we can test it).

ml-evs commented 1 year ago

You can try installing from https://github.com/Materials-Consortia/optimade-python-tools/pull/1757 and using the CONFIG.elide_data_returned = True setting to disable the count

ml-evs commented 1 year ago

* Basically, what seems to be the slow part, is just the counting of the total number of structures. (`$match` everything in the collection; group by `$_id` and then just sum the number).

* This command has to do a full scan of the documents (`COLLSCAN`). I don't think this command, in it's current state, could be sped up by using indexes.

I am wondering if this functionality could be implemented in a more efficient way. For example, db.structures.count() runs instantly.

Reading this again, do you just mean that count() without a filter is very fast (and thus just listing all structures without a filter should be too), or do you mean that the count step itself is fast? It might instead be that the COLLSCAN comes from a sort on an unindexed field

eimrek commented 1 year ago

Thanks for the comments and the modification https://github.com/Materials-Consortia/optimade-python-tools/pull/1757. I'll test it when i have a moment.

Regarding your last question, just to explain it a bit more:

if I run

> db.structures.aggregate( [
...    { "$match": {} },
...    { "$group": { "_id": 1,"n": { $sum: 1 } } }
... ] )

{ "_id" : 1, "n" : 4396695 }
>

You see that it just outputs the count. this takes 2+ minutes.

If i do

> db.structures.count()
4396695

I get the same result immediately.

Of course, for more complicated queries, this probably doesn't hold.

Regarding if this is the only slow part, i'm not sure. Nothing else was in the profiling log. I suspect that if count() was used, then the whole /structures endpoint would be fast, but I'm not 100% sure. I guess we'll find out when I test https://github.com/Materials-Consortia/optimade-python-tools/pull/1757.

eimrek commented 1 year ago

Hi @ml-evs, the changes in https://github.com/Materials-Consortia/optimade-python-tools/pull/1757 indeed made the API fast, see here:

https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures

However, there seems to be a small issue, the "links": ""next" data seems to be missing, and when doing

https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures?page_offset=25

It shows the "links": ""next" as the first value, but to the current page, instead of the next one.

eimrek commented 12 months ago

Fixed by https://github.com/Materials-Consortia/optimade-python-tools/pull/1757

materialscloud-org / optimade-maker

MongoDB slow for large databases #24