Magma: OOM issue with specific queries

coleshaw commented 2 years ago

I've been trying to root-cause the OOM issue with specific Magma queries, and I think @corps is correct in what is happening. A good fix is currently beyond my brainstorming abilities -- I've played around with the GC, but I can't seem to make a dent in the memory explosion or trigger a clean up. I have a very minor fix that addresses part of this, but I think a real, significant fix would require re-architecting everything query-related.

I'm actually able to reproduce the issue locally, where a query will consume >12GB of memory and it is not released upon completion of the request (works against /retrieve and /query). Profiling the issue shows that basically there are huge numbers of allocated strings, i.e.

  1 Total·allocated:·4.57·GB·(19773266·objects)¬
  2 Total·retained:··53.68·kB·(488·objects)¬
  3 ¬
  4 allocated·memory·by·gem¬
  5 -----------------------------------¬
  6 ···4.16·GB··sequel-5.57.0¬
  7 ·410.21·MB··app/lib¬
  8 ···3.48·kB··puma-5.6.4¬
  9 ···3.40·kB··etna/lib¬
 10 ¬
 11 allocated·memory·by·file¬
 12 -----------------------------------¬
 13 ···4.15·GB··/bundle/ruby/2.7.0/gems/sequel-5.57.0/lib/sequel/adapters/postgres.rb¬
 14 ·410.01·MB··/app/lib/magma/query/predicate/model.rb¬
 15 ···7.76·MB··/bundle/ruby/2.7.0/gems/sequel-5.57.0/lib/sequel/dataset/actions.rb¬

451 Allocated·String·Report¬
452 -----------------------------------¬
453 ···1520643··"UCSF·Immunoprofiler"¬
454 ···1520640··/bundle/ruby/2.7.0/gems/sequel-5.57.0/lib/sequel/adapters/postgres.rb:610¬
455 ·········2··/bundle/ruby/2.7.0/gems/sequel-5.57.0/lib/sequel/adapters/postgres.rb:780¬
456 ·········1··/app/lib/magma/query/predicate/model.rb:82¬
457 ¬
458 ····760401··"ritpwlgosh"¬
459 ····760354··/app/lib/magma/query/predicate/model.rb:262¬
460 ········34··/app/lib/magma/query/predicate/column.rb:41¬
461 ·········7··/bundle/ruby/2.7.0/gems/sequel-5.57.0/lib/sequel/dataset/sql.rb:686¬
462 ·········3··/app/lib/magma/query/predicate.rb:155¬
463 ·········2··/bundle/ruby/2.7.0/gems/sequel-5.57.0/lib/sequel/adapters/utils/unmodified_identifiers.rb:24¬
464 ·········1··/bundle/ruby/2.7.0/gems/sequel-5.57.0/lib/sequel/core.rb:230¬
465 ¬
466 ····760400··"ritpwlgosh_plate_name"¬
467 ····760356··/app/lib/magma/query/predicate/model.rb:262¬
468 ········34··/app/lib/magma/query/predicate/column.rb:41¬
469 ·········4··/bundle/ruby/2.7.0/gems/sequel-5.57.0/lib/sequel/adapters/utils/unmodified_identifiers.rb:24¬
470 ·········4··/bundle/ruby/2.7.0/gems/sequel-5.57.0/lib/sequel/dataset/sql.rb:686¬
471 ·········2··/bundle/ruby/2.7.0/gems/sequel-5.57.0/lib/sequel/adapters/postgres.rb:610¬

Because many of the relationships in the query are one-to-many (project to many collections), we're creating a Cartesian product through all of our joins. If you look at the Postgres EXPLAIN (ANALYZE), you see something like 604 million virtual rows created (pardon the aliases), I think each of which requires memory being allocated, because our query architecture gets back all records, groups them, re-organizes them, etc.?

magma_development=# EXPLAIN (ANALYZE) SELECT "sullsvsczp"."name" AS "sullsvsczp_name", "sullsvsczp"."name" AS "sullsvsczp_name", "yhmhpykmap"."pool_name" AS "yhmhpykmap_pool_name", "yhmhpykmap"."pool_name" AS "yhmhpykmap_pool_name", "snzfcfyusr"."name" AS "snzfcfyusr_name", "snzfcfyusr"."name" AS "snzfcfyusr_name", "rijnmcfgcs"."name" AS "rijnmcfgcs_name", "rijnmcfgcs"."name" AS "rijnmcfgcs_name", "xtqbcgdjya"."plate_name" AS "xtqbcgdjya_plate_name", "xtqbcgdjya"."plate_name" AS "xtqbcgdjya_plate_name", "wzpotluqmm"."tube_name" AS "wzpotluqmm_tube_name", "wzpotluqmm"."tube_name" AS "wzpotluqmm_tube_name", "sullsvsczp"."faq" AS "sullsvsczp_faq" FROM "ipi"."projects" AS "sullsvsczp" FULL OUTER JOIN "ipi"."cytof_pools" AS "yhmhpykmap" ON ("sullsvsczp"."id" = "yhmhpykmap"."project_id") FULL OUTER JOIN "ipi"."documents" AS "snzfcfyusr" ON ("sullsvsczp"."id" = "snzfcfyusr"."project_id") FULL OUTER JOIN "ipi"."experiments" AS "rijnmcfgcs" ON ("sullsvsczp"."id" = "rijnmcfgcs"."project_id") FULL OUTER JOIN "ipi"."rna_seq_plates" AS "xtqbcgdjya" ON ("sullsvsczp"."id" = "xtqbcgdjya"."project_id") FULL OUTER JOIN "ipi"."sc_rna_seq_pools" AS "wzpotluqmm" ON ("sullsvsczp"."id" = "wzpotluqmm"."project_id") WHERE ("sullsvsczp"."name" IN ('UCSF Immunoprofiler')) ORDER BY "sullsvsczp"."name";
                                                                            QUERY PLAN                                                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=22.69..46035.00 rows=604800000 width=366) (actual time=0.156..304.146 rows=760320 loops=1)
   Merge Cond: (yhmhpykmap.project_id = sullsvsczp.id)
   ->  Index Scan using ipi_cytof_pools_project_id_index on cytof_pools yhmhpykmap  (cost=0.14..51.15 rows=200 width=36) (actual time=0.016..0.033 rows=3 loops=1)
   ->  Materialize  (cost=22.55..8183.35 rows=3024000 width=171) (actual time=0.135..200.811 rows=760318 loops=1)
         ->  Merge Left Join  (cost=22.55..623.35 rows=3024000 width=171) (actual time=0.133..44.458 rows=253440 loops=1)
               Merge Cond: (sullsvsczp.id = snzfcfyusr.project_id)
               ->  Merge Left Join  (cost=16.72..29.53 rows=33600 width=139) (actual time=0.096..2.147 rows=8448 loops=1)
                     Merge Cond: (sullsvsczp.id = wzpotluqmm.project_id)
                     ->  Merge Left Join  (cost=4.81..5.86 rows=672 width=107) (actual time=0.077..0.322 rows=704 loops=1)
                           Merge Cond: (sullsvsczp.id = rijnmcfgcs.project_id)
                           ->  Merge Left Join  (cost=3.14..3.63 rows=32 width=75) (actual time=0.056..0.085 rows=32 loops=1)
                                 Merge Cond: (sullsvsczp.id = xtqbcgdjya.project_id)
                                 ->  Sort  (cost=1.02..1.03 rows=1 width=68) (actual time=0.020..0.021 rows=1 loops=1)
                                       Sort Key: sullsvsczp.id
                                       Sort Method: quicksort  Memory: 25kB
                                       ->  Seq Scan on projects sullsvsczp  (cost=0.00..1.01 rows=1 width=68) (actual time=0.010..0.011 rows=1 loops=1)
                                             Filter: (name = 'UCSF Immunoprofiler'::text)
                                 ->  Sort  (cost=2.12..2.20 rows=32 width=11) (actual time=0.033..0.042 rows=33 loops=1)
                                       Sort Key: xtqbcgdjya.project_id
                                       Sort Method: quicksort  Memory: 26kB
                                       ->  Seq Scan on rna_seq_plates xtqbcgdjya  (cost=0.00..1.32 rows=32 width=11) (actual time=0.009..0.018 rows=38 loops=1)
                           ->  Sort  (cost=1.67..1.72 rows=21 width=36) (actual time=0.021..0.091 rows=673 loops=1)
                                 Sort Key: rijnmcfgcs.project_id
                                 Sort Method: quicksort  Memory: 26kB
                                 ->  Seq Scan on experiments rijnmcfgcs  (cost=0.00..1.21 rows=21 width=36) (actual time=0.006..0.012 rows=22 loops=1)
                     ->  Sort  (cost=11.91..12.04 rows=50 width=36) (actual time=0.017..0.445 rows=8449 loops=1)
                           Sort Key: wzpotluqmm.project_id
                           Sort Method: quicksort  Memory: 26kB
                           ->  Seq Scan on sc_rna_seq_pools wzpotluqmm  (cost=0.00..10.50 rows=50 width=36) (actual time=0.005..0.010 rows=20 loops=1)
               ->  Sort  (cost=5.82..6.05 rows=90 width=36) (actual time=0.036..10.624 rows=253441 loops=1)
                     Sort Key: snzfcfyusr.project_id
                     Sort Method: quicksort  Memory: 28kB
                     ->  Seq Scan on documents snzfcfyusr  (cost=0.00..2.90 rows=90 width=36) (actual time=0.005..0.019 rows=45 loops=1)
 Planning time: 1.235 ms
 Execution time: 329.407 ms
(35 rows)

This is the most relevant discussion similar to what I think is happening. We're already using sequel_pg, and I don't think the GC will clean up soon enough to prevent the memory issue. I've tried triggering the GC manually post-query, but it doesn't seem to release the memory either (though I'm probably doing it wrong).

Posting the data here to document progress. I'm probably going to shelve this for awhile, since I think there's a more involved fix required to the query architecture that could potentially break many other things ... @corps or @graft , any suggestions on approaches?

graft commented 2 years ago

Could you summarize the GC issue here? Also, a question I have is: some parts of the query reduction still occur in memory (like ::first I think), are these responsible for the large number of rows being returned? I.e., if we moved these parts of the query into SQL, could the number of rows be cut down?

coleshaw commented 2 years ago

So I don't think the issue is specifically with GC, rather that GC could be part of the solution. I do not think moving parts of the query into SQL would help cut the number of rows down.

When doing JOINs against one-to-many relationships, what happens is that the number of records across all such tables is multiplied. So take a simple example, where say, we have:

    project
    /          \
plate     pool

If there are 1 project record, 3 plate records, and 4 pool records, Postgres will create a JOIN table with 12 total rows:

project	plate	pool
proj1	plate1	pool1
proj1	plate1	pool2
proj1	plate1	pool3
proj1	plate1	pool4
proj1	plate2	pool1
proj1	plate2	pool2
proj1	plate2	pool3
proj1	plate2	pool4
proj1	plate3	pool1
proj1	plate3	pool2
proj1	plate3	pool3
proj1	plate3	pool4

Now imagine that we're doing this in a project with, say, 21 experiments, 39 documents, 3 cytof pools, 37 rna plates, and 12 sc pools. That's 1,090,908 rows with lots of duplicated data. I believe how Sequel handles this in the postgres adapter is that each row is allocated to a Hash in memory, and each column value is allocated to a String. So even though they are duplicate text values, each String value is repeated many, many times in memory and thus takes up a lot of total memory, as can be seen in the memory profiling.

I think theoretically we could tune the GC to try and clean out these rows after the query is done (?), but unless we switch to some sort of streaming data handling, our current architecture loads all the rows into memory and then serializes them into the query output format. I think that just kicks the can down the road, and I think we might still have to deal with ballooning memory (to get the output) before releasing it post-response via GC?

Perhaps some different approaches might be:

1) Stream the output, a la Metis tail. This would break how clients currently handle responses, but I think would be better able to handle large cross-join type queries like this. 2) Break up ChildAttribute and TableAttribute queries to join in Ruby instead of via SQL joins. This would reduce the Cartesian product effect, but probably introduces some other types of inefficiencies... 3) It might be possible to get the same effect that we need via a set of subqueries (instead of JOINs), but I haven't thought through the details of what a query might look like, or if that would work with how we do filtering, etc., nor if that would save us anything in terms of memory.

corps commented 2 years ago

Great analysis. This lines up with some of my suspicions.

Stream processing is probably ideal no matter what. We should see how far we can push sequel to give us the low level psql adapter primitives that enable this handling better. It is theoritically possible but the impact on our codebase could be large.

Further optimizations will be further work, there's no silver bullet. We're building a custom highly generic query engine on top of a toolkit designed for fixed depth CRUD interactions built on top of an already existing generic query engine, so we're likely going to have to fight each problem that psql already solves, but do so while pulling it through the sieve of abstractions we've layered on top of it.

mountetna / monoetna

Magma: OOM issue with specific queries #937