Closed coleshaw closed 1 year ago
Could you summarize the GC issue here? Also, a question I have is: some parts of the query reduction still occur in memory (like ::first I think), are these responsible for the large number of rows being returned? I.e., if we moved these parts of the query into SQL, could the number of rows be cut down?
So I don't think the issue is specifically with GC, rather that GC could be part of the solution. I do not think moving parts of the query into SQL would help cut the number of rows down.
When doing JOINs against one-to-many relationships, what happens is that the number of records across all such tables is multiplied. So take a simple example, where say, we have:
project
/ \
plate pool
If there are 1 project record, 3 plate records, and 4 pool records, Postgres will create a JOIN table with 12 total rows:
project | plate | pool |
---|---|---|
proj1 | plate1 | pool1 |
proj1 | plate1 | pool2 |
proj1 | plate1 | pool3 |
proj1 | plate1 | pool4 |
proj1 | plate2 | pool1 |
proj1 | plate2 | pool2 |
proj1 | plate2 | pool3 |
proj1 | plate2 | pool4 |
proj1 | plate3 | pool1 |
proj1 | plate3 | pool2 |
proj1 | plate3 | pool3 |
proj1 | plate3 | pool4 |
Now imagine that we're doing this in a project with, say, 21 experiments, 39 documents, 3 cytof pools, 37 rna plates, and 12 sc pools. That's 1,090,908 rows with lots of duplicated data. I believe how Sequel handles this in the postgres adapter is that each row is allocated to a Hash in memory, and each column value is allocated to a String. So even though they are duplicate text values, each String value is repeated many, many times in memory and thus takes up a lot of total memory, as can be seen in the memory profiling.
I think theoretically we could tune the GC to try and clean out these rows after the query is done (?), but unless we switch to some sort of streaming data handling, our current architecture loads all the rows into memory and then serializes them into the query output format. I think that just kicks the can down the road, and I think we might still have to deal with ballooning memory (to get the output) before releasing it post-response via GC?
Perhaps some different approaches might be:
1) Stream the output, a la Metis tail
. This would break how clients currently handle responses, but I think would be better able to handle large cross-join type queries like this.
2) Break up ChildAttribute and TableAttribute queries to join in Ruby instead of via SQL joins. This would reduce the Cartesian product effect, but probably introduces some other types of inefficiencies...
3) It might be possible to get the same effect that we need via a set of subqueries (instead of JOINs), but I haven't thought through the details of what a query might look like, or if that would work with how we do filtering, etc., nor if that would save us anything in terms of memory.
Great analysis. This lines up with some of my suspicions.
Stream processing is probably ideal no matter what. We should see how far we can push sequel to give us the low level psql adapter primitives that enable this handling better. It is theoritically possible but the impact on our codebase could be large.
Further optimizations will be further work, there's no silver bullet. We're building a custom highly generic query engine on top of a toolkit designed for fixed depth CRUD interactions built on top of an already existing generic query engine, so we're likely going to have to fight each problem that psql already solves, but do so while pulling it through the sieve of abstractions we've layered on top of it.
I've been trying to root-cause the OOM issue with specific Magma queries, and I think @corps is correct in what is happening. A good fix is currently beyond my brainstorming abilities -- I've played around with the GC, but I can't seem to make a dent in the memory explosion or trigger a clean up. I have a very minor fix that addresses part of this, but I think a real, significant fix would require re-architecting everything query-related.
I'm actually able to reproduce the issue locally, where a query will consume >12GB of memory and it is not released upon completion of the request (works against
/retrieve
and/query
). Profiling the issue shows that basically there are huge numbers of allocated strings, i.e.Because many of the relationships in the query are one-to-many (project to many collections), we're creating a Cartesian product through all of our joins. If you look at the Postgres
EXPLAIN (ANALYZE)
, you see something like 604 million virtual rows created (pardon the aliases), I think each of which requires memory being allocated, because our query architecture gets back all records, groups them, re-organizes them, etc.?This is the most relevant discussion similar to what I think is happening. We're already using
sequel_pg
, and I don't think the GC will clean up soon enough to prevent the memory issue. I've tried triggering the GC manually post-query, but it doesn't seem to release the memory either (though I'm probably doing it wrong).Posting the data here to document progress. I'm probably going to shelve this for awhile, since I think there's a more involved fix required to the query architecture that could potentially break many other things ... @corps or @graft , any suggestions on approaches?