Closed eimrek closed 12 months ago
Yeah, the counting is a pain but I don't see an easy way around it. Each query needs to know the total number of structures it returns in the response, according to the spec. The only speed-up we could do would be to implement some session-based cursor so that we don't have to repeat queries with a new skip every request (and thus don't have to repeat the count).
As the data_returned
field is only a "SHOULD" in the specification, we could add a config option in optimade-python-tools that disables it (I can make a PR for this and we can test it).
You can try installing from https://github.com/Materials-Consortia/optimade-python-tools/pull/1757 and using the CONFIG.elide_data_returned = True
setting to disable the count
* Basically, what seems to be the slow part, is just the counting of the total number of structures. (`$match` everything in the collection; group by `$_id` and then just sum the number). * This command has to do a full scan of the documents (`COLLSCAN`). I don't think this command, in it's current state, could be sped up by using indexes.
I am wondering if this functionality could be implemented in a more efficient way. For example,
db.structures.count()
runs instantly.
Reading this again, do you just mean that count()
without a filter is very fast (and thus just listing all structures without a filter should be too), or do you mean that the count step itself is fast? It might instead be that the COLLSCAN
comes from a sort on an unindexed field
Thanks for the comments and the modification https://github.com/Materials-Consortia/optimade-python-tools/pull/1757. I'll test it when i have a moment.
Regarding your last question, just to explain it a bit more:
if I run
> db.structures.aggregate( [
... { "$match": {} },
... { "$group": { "_id": 1,"n": { $sum: 1 } } }
... ] )
{ "_id" : 1, "n" : 4396695 }
>
You see that it just outputs the count. this takes 2+ minutes.
If i do
> db.structures.count()
4396695
I get the same result immediately.
Of course, for more complicated queries, this probably doesn't hold.
Regarding if this is the only slow part, i'm not sure. Nothing else was in the profiling log. I suspect that if count()
was used, then the whole /structures
endpoint would be fast, but I'm not 100% sure. I guess we'll find out when I test https://github.com/Materials-Consortia/optimade-python-tools/pull/1757.
Hi @ml-evs, the changes in https://github.com/Materials-Consortia/optimade-python-tools/pull/1757 indeed made the API fast, see here:
https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures
However, there seems to be a small issue, the "links": ""next"
data seems to be missing, and when doing
https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures?page_offset=25
It shows the "links": ""next"
as the first value, but to the current page, instead of the next one.
I spent some time investigating the slowness of the
li-ion-conductors
optimade database.Here's the info endpoint: https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/info
Accessing the
/structures
endpoint takes over 2 minutes.I turned on performance profiling (
db.setProfilingLevel(1)
) and here's the part of the log for accessing the/structures
endpoint (only the slow commands should show up here, meaning everything else was fast, i think):/structures profiling (click to show)
``` > db.system.profile.find().pretty() { "op" : "command", "ns" : "li-ion-conductors.structures", "command" : { "aggregate" : "structures", "pipeline" : [ { "$match" : { } }, { "$group" : { "_id" : 1, "n" : { "$sum" : 1 } } } ], "cursor" : { }, "lsid" : { "id" : UUID("f822cd93-f176-4405-8c3a-062a5c7e79d8") }, "$db" : "li-ion-conductors" }, "keysExamined" : 0, "docsExamined" : 4396695, "cursorExhausted" : true, "numYield" : 7910, "nreturned" : 1, "locks" : { "FeatureCompatibilityVersion" : { "acquireCount" : { "r" : NumberLong(7913) } }, "ReplicationStateTransition" : { "acquireCount" : { "w" : NumberLong(7913) } }, "Global" : { "acquireCount" : { "r" : NumberLong(7913) } }, "Database" : { "acquireCount" : { "r" : NumberLong(7912) } }, "Collection" : { "acquireCount" : { "r" : NumberLong(7912) } }, "Mutex" : { "acquireCount" : { "r" : NumberLong(2) } } }, "flowControl" : { }, "storage" : { "data" : { "bytesRead" : NumberLong("17760670501"), "timeReadingMicros" : NumberLong(135074290) } }, "responseLength" : 141, "protocol" : "op_msg", "millis" : 139328, "planSummary" : "COLLSCAN", "ts" : ISODate("2023-08-22T14:51:08.005Z"), "client" : "172.18.0.1", "allUsers" : [ ], "user" : "" } ```Some key points:
$match
everything in the collection; group by$_id
and then just sum the number).COLLSCAN
). I don't think this command, in it's current state, could be sped up by using indexes.I am wondering if this functionality could be implemented in a more efficient way. For example,
db.structures.count()
runs instantly.Just for additional information, initially also accessing a single structure was as slow (2+ min), e.g. via
https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures/5b5b4b01-5b7e-48ad-8e17-8077f9b0b5d2
But after I added the
id
index withdb.structures.createIndex({ id: 1 })
, it's fast now.Pinging @ml-evs @unkcpz @superstar54 for comments/ideas regarding the "counting" speedup. i suspect this is probably something that should be adapted in https://github.com/Materials-Consortia/optimade-python-tools?