oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Should we add explicit index hints to queries? #5302

Open jgallagher opened 8 months ago

jgallagher commented 8 months ago

On #5299, we had a query that was using an inordinate amount of memory because it was FULL SCANing the pkey index of a table that was erroneously very large. #5301 fixes the "erroneously very large", but it's still surprising that the query in question was doing a FULL SCAN; when running omicron tests locally, the query required the creation of the inv_zpool_by_id_and_time index to pass our "no full scans allowed" check.

When running a portion of the query by hand under EXPLAIN ANALYZE, we can see that even once the table has been pruned down to its expected size of a few hundred rows, we're still full scanning inv_zpool@inv_zpool_pkey:

root@[fd00:1122:3344:109::3]:32221/omicron> explain analyze WITH "old_zpool_usage" AS (SELECT "dataset"."pool_id", sum("dataset"."size_used") AS size_used FROM "dataset" WHERE (("dataset"."size_used" IS NOT NULL) AND ("dataset"."time_deleted" IS NULL)) GROUP BY "dataset"."pool_id"), "candidate_zpools" AS (SELECT DISTINCT ON ("zpool"."sled_id")"old_zpool_usage"."pool_id" FROM ("old_zpool_usage" INNER JOIN ("zpool" INNER JOIN "sled" ON ("zpool"."sled_id" = "sled"."id")) ON ("zpool"."id" = "old_zpool_usage"."pool_id")) WHERE (((("old_zpool_usage"."size_used" + 10737418240) <= (SELECT total_size FROM omicron.public.inv_zpool WHERE inv_zpool.id = old_zpool_usage.pool_id ORDER BY inv_zpool.time_collected DESC LIMIT 1)) AND ("sled"."sled_policy" = 'in_service')) AND ("sled"."sled_state" = 'active')) ORDER BY "zpool"."sled_id", md5((CAST("zpool"."id" as BYTEA)))) select * from candidate_zpools;
                                                                info
-------------------------------------------------------------------------------------------------------------------------------------
  planning time: 4ms
  execution time: 264ms
  distribution: full
  vectorized: true
  rows read from KV: 1,716 (177 KiB)
  cumulative time spent in KV: 265ms
  maximum memory usage: 890 KiB
  network usage: 45 KiB (36 messages)

  • render
  │ nodes: n1
  │ actual row count: 11
  │ estimated row count: 11
  │
  └── • distinct
      │ nodes: n1, n4
      │ actual row count: 11
      │ estimated max memory allocated: 20 KiB
      │ estimated max sql temp disk usage: 0 B
      │ distinct on: sled_id
      │
      └── • sort
          │ nodes: n1, n4
          │ actual row count: 110
          │ estimated max memory allocated: 40 KiB
          │ estimated max sql temp disk usage: 0 B
          │ estimated row count: 52
          │ order: +column53
          │
          └── • render
              │ nodes: n1, n4
              │ actual row count: 110
              │ estimated max memory allocated: 130 KiB
              │ estimated max sql temp disk usage: 0 B
              │ estimated row count: 52
              │
              └── • hash join
                  │ nodes: n1, n4
                  │ actual row count: 110
                  │ estimated max memory allocated: 130 KiB
                  │ estimated max sql temp disk usage: 0 B
                  │ estimated row count: 52
                  │ equality: (sled_id) = (id)
                  │ right cols are key
                  │
                  ├── • hash join
                  │   │ nodes: n1, n4
                  │   │ actual row count: 110
                  │   │ estimated max memory allocated: 120 KiB
                  │   │ estimated max sql temp disk usage: 0 B
                  │   │ estimated row count: 34
                  │   │ equality: (id) = (pool_id)
                  │   │ left cols are key
                  │   │ right cols are key
                  │   │
                  │   ├── • scan
                  │   │     nodes: n1
                  │   │     actual row count: 132
                  │   │     KV time: 8ms
                  │   │     KV contention time: 0µs
                  │   │     KV rows read: 132
                  │   │     KV bytes read: 13 KiB
                  │   │     estimated max memory allocated: 30 KiB
                  │   │     estimated row count: 132 (100% of the table; stats collected 5 days ago)
                  │   │     table: zpool@zpool_pkey
                  │   │     spans: FULL SCAN
                  │   │
                  │   └── • filter
                  │       │ nodes: n1, n4
                  │       │ actual row count: 110
                  │       │ estimated row count: 34
                  │       │ filter: total_size >= (size_used + 10737418240)
                  │       │
                  │       └── • distinct
                  │           │ nodes: n1, n4
                  │           │ actual row count: 110
                  │           │ estimated max memory allocated: 20 KiB
                  │           │ estimated max sql temp disk usage: 0 B
                  │           │ distinct on: pool_id
                  │           │
                  │           └── • sort
                  │               │ nodes: n1, n4
                  │               │ actual row count: 1,210
                  │               │ estimated max memory allocated: 400 KiB
                  │               │ estimated max sql temp disk usage: 0 B
                  │               │ estimated row count: 1,123
                  │               │ order: -time_collected
                  │               │
                  │               └── • hash join (right outer)
                  │                   │ nodes: n1, n4
                  │                   │ actual row count: 1,210
                  │                   │ estimated max memory allocated: 230 KiB
                  │                   │ estimated max sql temp disk usage: 0 B
                  │                   │ estimated row count: 1,123
                  │                   │ equality: (id) = (pool_id)
                  │                   │ right cols are key
                  │                   │
                  │                   ├── • scan
                  │                   │     nodes: n1
                  │                   │     actual row count: 1,452
                  │                   │     KV time: 254ms
                  │                   │     KV contention time: 0µs
                  │                   │     KV rows read: 1,452
                  │                   │     KV bytes read: 147 KiB
                  │                   │     estimated max memory allocated: 230 KiB
                  │                   │     estimated row count: 1,452 (100% of the table; stats collected 6 seconds ago)
                  │                   │     table: inv_zpool@inv_zpool_pkey
                  │                   │     spans: FULL SCAN
                  │                   │
                  │                   └── • render
                  │                       │ nodes: n4
                  │                       │ actual row count: 110
                  │                       │ estimated max memory allocated: 140 KiB
                  │                       │ estimated max sql temp disk usage: 0 B
                  │                       │ estimated row count: 102
                  │                       │
                  │                       └── • group (hash)
                  │                           │ nodes: n4
                  │                           │ actual row count: 110
                  │                           │ estimated max memory allocated: 140 KiB
                  │                           │ estimated max sql temp disk usage: 0 B
                  │                           │ estimated row count: 102
                  │                           │ group by: pool_id
                  │                           │
                  │                           └── • filter
                  │                               │ nodes: n4
                  │                               │ actual row count: 110
                  │                               │ estimated row count: 110
                  │                               │ filter: (size_used IS NOT NULL) AND (time_deleted IS NULL)
                  │                               │
                  │                               └── • scan
                  │                                     nodes: n4
                  │                                     actual row count: 121
                  │                                     KV time: 1ms
                  │                                     KV contention time: 0µs
                  │                                     KV rows read: 121
                  │                                     KV bytes read: 14 KiB
                  │                                     estimated max memory allocated: 30 KiB
                  │                                     estimated row count: 121 (100% of the table; stats collected 5 minutes ago)
                  │                                     table: dataset@dataset_pkey
                  │                                     spans: FULL SCAN
                  │
                  └── • filter
                      │ nodes: n1
                      │ actual row count: 11
                      │ estimated row count: 11
                      │ filter: (sled_policy = 'in_service') AND (sled_state = 'active')
                      │
                      └── • scan
                            nodes: n1
                            actual row count: 11
                            KV time: 1ms
                            KV contention time: 0µs
                            KV rows read: 11
                            KV bytes read: 2.0 KiB
                            estimated max memory allocated: 20 KiB
                            estimated row count: 11 (100% of the table; stats collected 14 hours ago)
                            table: sled@sled_pkey
                            spans: FULL SCAN
(159 rows)

Time: 271ms total (execution 269ms / network 2ms)

@smklein pointed out that CRDB supports index hints. If we modify the above query, changing FROM omicron.public.inv_zpool to FROM omicron.public.inv_zpool@{NO_FULL_SCAN}, the full scan of inv_zpool is gone, and the max memory used by the query is reduced significantly:

root@[fd00:1122:3344:109::3]:32221/omicron> explain analyze WITH "old_zpool_usage" AS (SELECT "dataset"."pool_id", sum("dataset"."size_used") AS size_used FROM "dataset" WHERE (("dataset"."size_used" IS NOT NULL) AND ("dataset"."time_deleted" IS NULL)) GROUP BY "dataset"."pool_id"), "candidate_zpools" AS (SELECT DISTINCT ON ("zpool"."sled_id")"old_zpool_usage"."pool_id" FROM ("old_zpool_usage" INNER JOIN ("zpool" INNER JOIN "sled" ON ("zpool"."sled_id" = "sled"."id")) ON ("zpool"."id" = "old_zpool_usage"."pool_id")) WHERE (((("old_zpool_usage"."size_used" + 10737418240) <= (SELECT total_size FROM omicron.public.inv_zpool@{NO_FULL_SCAN} WHERE inv_zpool.id = old_zpool_usage.pool_id ORDER BY inv_zpool.time_collected DESC LIMIT 1)) AND ("sled"."sled_policy" = 'in_service')) AND ("sled"."sled_state" = 'active')) ORDER BY "zpool"."sled_id", md5((CAST("zpool"."id" as BYTEA)))) select * from candidate_zpools;
                                                                   info
------------------------------------------------------------------------------------------------------------------------------------------
  planning time: 4ms
  execution time: 381ms
  distribution: full
  vectorized: true
  rows read from KV: 2,684 (245 KiB)
  cumulative time spent in KV: 364ms
  maximum memory usage: 360 KiB
  network usage: 12 KiB (24 messages)

  • render
  │ nodes: n1
  │ actual row count: 11
  │ estimated row count: 11
  │
  └── • distinct
      │ nodes: n1, n4
      │ actual row count: 11
      │ estimated max memory allocated: 20 KiB
      │ estimated max sql temp disk usage: 0 B
      │ distinct on: sled_id
      │
      └── • sort
          │ nodes: n1, n4
          │ actual row count: 110
          │ estimated max memory allocated: 50 KiB
          │ estimated max sql temp disk usage: 0 B
          │ estimated row count: 52
          │ order: +column53
          │
          └── • render
              │ nodes: n1, n4
              │ actual row count: 110
              │ estimated max memory allocated: 130 KiB
              │ estimated max sql temp disk usage: 0 B
              │ estimated row count: 52
              │
              └── • hash join
                  │ nodes: n1, n4
                  │ actual row count: 110
                  │ estimated max memory allocated: 130 KiB
                  │ estimated max sql temp disk usage: 0 B
                  │ estimated row count: 52
                  │ equality: (sled_id) = (id)
                  │ right cols are key
                  │
                  ├── • hash join
                  │   │ nodes: n1, n4
                  │   │ actual row count: 110
                  │   │ estimated max memory allocated: 120 KiB
                  │   │ estimated max sql temp disk usage: 0 B
                  │   │ estimated row count: 34
                  │   │ equality: (id) = (pool_id)
                  │   │ left cols are key
                  │   │ right cols are key
                  │   │
                  │   ├── • scan
                  │   │     nodes: n1
                  │   │     actual row count: 132
                  │   │     KV time: 2ms
                  │   │     KV contention time: 0µs
                  │   │     KV rows read: 132
                  │   │     KV bytes read: 13 KiB
                  │   │     estimated max memory allocated: 30 KiB
                  │   │     estimated row count: 132 (100% of the table; stats collected 5 days ago)
                  │   │     table: zpool@zpool_pkey
                  │   │     spans: FULL SCAN
                  │   │
                  │   └── • filter
                  │       │ nodes: n4
                  │       │ actual row count: 110
                  │       │ estimated row count: 34
                  │       │ filter: total_size >= (size_used + 10737418240)
                  │       │
                  │       └── • distinct
                  │           │ nodes: n4
                  │           │ actual row count: 110
                  │           │ estimated max memory allocated: 10 KiB
                  │           │ estimated max sql temp disk usage: 0 B
                  │           │ distinct on: pool_id
                  │           │
                  │           └── • sort
                  │               │ nodes: n4
                  │               │ actual row count: 1,210
                  │               │ estimated max memory allocated: 380 KiB
                  │               │ estimated max sql temp disk usage: 0 B
                  │               │ estimated row count: 1,123
                  │               │ order: -time_collected
                  │               │
                  │               └── • lookup join (left outer)
                  │                   │ nodes: n4
                  │                   │ actual row count: 1,210
                  │                   │ KV time: 159ms
                  │                   │ KV contention time: 0µs
                  │                   │ KV rows read: 1,210
                  │                   │ KV bytes read: 102 KiB
                  │                   │ estimated max memory allocated: 760 KiB
                  │                   │ table: inv_zpool@inv_zpool_pkey
                  │                   │ equality: (inv_collection_id, sled_id, id) = (inv_collection_id,sled_id,id)
                  │                   │ equality cols are key
                  │                   │
                  │                   └── • lookup join (left outer)
                  │                       │ nodes: n4
                  │                       │ actual row count: 1,210
                  │                       │ KV time: 199ms
                  │                       │ KV contention time: 0µs
                  │                       │ KV rows read: 1,210
                  │                       │ KV bytes read: 114 KiB
                  │                       │ estimated max memory allocated: 60 KiB
                  │                       │ estimated row count: 1,123
                  │                       │ table: inv_zpool@inv_zpool_by_id_and_time
                  │                       │ equality: (pool_id) = (id)
                  │                       │
                  │                       └── • render
                  │                           │ nodes: n4
                  │                           │ actual row count: 110
                  │                           │ estimated max memory allocated: 140 KiB
                  │                           │ estimated max sql temp disk usage: 0 B
                  │                           │ estimated row count: 102
                  │                           │
                  │                           └── • group (hash)
                  │                               │ nodes: n4
                  │                               │ actual row count: 110
                  │                               │ estimated max memory allocated: 140 KiB
                  │                               │ estimated max sql temp disk usage: 0 B
                  │                               │ estimated row count: 102
                  │                               │ group by: pool_id
                  │                               │
                  │                               └── • filter
                  │                                   │ nodes: n4
                  │                                   │ actual row count: 110
                  │                                   │ estimated row count: 110
                  │                                   │ filter: (size_used IS NOT NULL) AND (time_deleted IS NULL)
                  │                                   │
                  │                                   └── • scan
                  │                                         nodes: n4
                  │                                         actual row count: 121
                  │                                         KV time: 3ms
                  │                                         KV contention time: 0µs
                  │                                         KV rows read: 121
                  │                                         KV bytes read: 14 KiB
                  │                                         estimated max memory allocated: 30 KiB
                  │                                         estimated row count: 121 (100% of the table; stats collected 15 seconds ago)
                  │                                         table: dataset@dataset_pkey
                  │                                         spans: FULL SCAN
                  │
                  └── • filter
                      │ nodes: n1
                      │ actual row count: 11
                      │ estimated row count: 11
                      │ filter: (sled_policy = 'in_service') AND (sled_state = 'active')
                      │
                      └── • scan
                            nodes: n1
                            actual row count: 11
                            KV time: 1ms
                            KV contention time: 0µs
                            KV rows read: 11
                            KV bytes read: 2.0 KiB
                            estimated max memory allocated: 20 KiB
                            estimated row count: 11 (100% of the table; stats collected 14 hours ago)
                            table: sled@sled_pkey
                            spans: FULL SCAN
(162 rows)

Time: 388ms total (execution 386ms / network 2ms)

There are at least a couple possible issues here:

jgallagher commented 8 months ago

Formatted version of the query:

WITH "old_zpool_usage" AS
(
         SELECT   "dataset"."pool_id", Sum("dataset"."size_used") AS size_used
         FROM     "dataset"
         WHERE    (("dataset"."size_used" IS NOT NULL)
                   AND
                   ("dataset"."time_deleted" IS NULL))
         GROUP BY "dataset"."pool_id") 
), "candidate_zpools" AS
(
                SELECT DISTINCT ON ("zpool"."sled_id") "old_zpool_usage"."pool_id"
                FROM            ("old_zpool_usage"
                INNER JOIN      ("zpool" INNER JOIN "sled" ON ("zpool"."sled_id" = "sled"."id")) ON ( "zpool"."id" = "old_zpool_usage"."pool_id"))
                WHERE           (((("old_zpool_usage"."size_used" + 10737418240) <=
                                    (SELECT   total_size
                                     FROM     omicron.PUBLIC.inv_zpool
                                     WHERE    inv_zpool.id = old_zpool_usage.pool_id
                                     ORDER BY inv_zpool.time_collected DESC limit 1))
                                AND ("sled"."sled_policy" = 'in_service'))
                                AND ("sled"."sled_state" = 'active'))
                ORDER BY        "zpool"."sled_id", Md5((Cast("zpool"."id" AS BYTEA))))
SELECT *
FROM   candidate_zpools 
jgallagher commented 8 months ago

@smklein pointed out that my use of SELECT * FROM candidate_zpools might be inducing a full scan. The full CTE from the dtrace log is:

WITH "old_regions" AS (SELECT "region"."id", "region"."time_created", "region"."time_modified", "region"."dataset_id", "region"."volume_id", "region"."block_size", "region"."blocks_per_extent", "region"."extent_count" FROM "region" WHERE ("region"."volume_id" = $1)), "old_zpool_usage" AS (SELECT "dataset"."pool_id", sum("dataset"."size_used") AS size_used FROM "dataset" WHERE (("dataset"."size_used" IS NOT NULL) AND ("dataset"."time_deleted" IS NULL)) GROUP BY "dataset"."pool_id"), "candidate_zpools" AS (SELECT DISTINCT ON ("zpool"."sled_id")"old_zpool_usage"."pool_id" FROM ("old_zpool_usage" INNER JOIN ("zpool" INNER JOIN "sled" ON ("zpool"."sled_id" = "sled"."id")) ON ("zpool"."id" = "old_zpool_usage"."pool_id")) WHERE (((("old_zpool_usage"."size_used" + 10737418240) <= (SELECT total_size FROM omicron.public.inv_zpool WHERE
              inv_zpool.id = old_zpool_usage.pool_id
              ORDER BY inv_zpool.time_collected DESC LIMIT 1)) AND ("sled"."sled_policy" = $2)) AND ("sled"."sled_state" = $3)) ORDER BY "zpool"."sled_id", md5((CAST("zpool"."id" as BYTEA) || $4))), "candidate_datasets" AS (SELECT DISTINCT ON ("dataset"."pool_id")"dataset"."id", "dataset"."pool_id" FROM ("dataset" INNER JOIN "candidate_zpools" ON ("dataset"."pool_id" = "candidate_zpools"."pool_id")) WHERE ((("dataset"."time_deleted" IS NULL) AND ("dataset"."size_used" IS NOT NULL)) AND ("dataset"."kind" = $5)) ORDER BY "dataset"."pool_id", md5((CAST("dataset"."id" as BYTEA) || $6))), "shuffled_candidate_datasets" AS (SELECT "candidate_datasets"."id", "candidate_datasets"."pool_id" FROM "candidate_datasets" ORDER BY md5((CAST("candidate_datasets"."id" as BYTEA) || $7)) LIMIT $8), "candidate_regions" AS (SELECT gen_random_uuid() AS id, now() AS time_created, now() AS time_modified, "shuffled_candidate_datasets"."id" AS dataset_id, $9 AS volume_id, $10 AS block_size, $11 AS blocks_per_extent, $12 AS extent_count FROM "shuffled_candidate_datasets"), "proposed_dataset_changes" AS (SELECT "candidate_regions"."dataset_id" AS id, "dataset"."pool_id" AS pool_id, (("candidate_regions"."block_size" * "candidate_regions"."blocks_per_extent") * "candidate_regions"."extent_count") AS size_used_delta FROM ("candidate_regions" INNER JOIN "dataset" ON ("dataset"."id" = "candidate_regions"."dataset_id"))), "do_insert" AS (SELECT (((((SELECT COUNT(*) FROM "old_regions" LIMIT $13) < $14) AND CAST(IF(((SELECT COUNT(*) FROM "candidate_zpools" LIMIT $15) >= $16), 'TRUE', 'Not enough space') AS BOOL)) AND CAST(IF(((SELECT COUNT(*) FROM "candidate_regions" LIMIT $17) >= $18), 'TRUE', 'Not enough datasets') AS BOOL)) AND CAST(IF(((SELECT COUNT(DISTINCT "dataset"."pool_id") FROM ("candidate_regions" INNER JOIN "dataset" ON ("candidate_regions"."dataset_id" = "dataset"."id")) LIMIT $19) >= $20), 'TRUE', 'Not enough unique zpools selected') AS BOOL)) AS insert), "inserted_regions" AS (INSERT INTO "region" ("id", "time_created", "time_modified", "dataset_id", "volume_id", "block_size", "blocks_per_extent", "extent_count") SELECT "candidate_regions"."id", "candidate_regions"."time_created", "candidate_regions"."time_modified", "candidate_regions"."dataset_id", "candidate_regions"."volume_id", "candidate_regions"."block_size", "candidate_regions"."blocks_per_extent", "candidate_regions"."extent_count" FROM "candidate_regions" WHERE (SELECT "do_insert"."insert" FROM "do_insert" LIMIT $21) RETURNING "region"."id", "region"."time_created", "region"."time_modified", "region"."dataset_id", "region"."volume_id", "region"."block_size", "region"."blocks_per_extent", "region"."extent_count"), "updated_datasets" AS (UPDATE "dataset" SET "size_used" = ("dataset"."size_used" + (SELECT "proposed_dataset_changes"."size_used_delta" FROM "proposed_dataset_changes" WHERE ("proposed_dataset_changes"."id" = "dataset"."id") LIMIT $22)) WHERE (("dataset"."id" = ANY(SELECT "proposed_dataset_changes"."id" FROM "proposed_dataset_changes")) AND (SELECT "do_insert"."insert" FROM "do_insert" LIMIT $23)) RETURNING "dataset"."id", "dataset"."time_created", "dataset"."time_modified", "dataset"."time_deleted", "dataset"."rcgen", "dataset"."pool_id", "dataset"."ip", "dataset"."port", "dataset"."kind", "dataset"."size_used") (SELECT "dataset"."id", "dataset"."time_created", "dataset"."time_modified", "dataset"."time_deleted", "dataset"."rcgen", "dataset"."pool_id", "dataset"."ip", "dataset"."port", "dataset"."kind", "dataset"."size_used", "old_regions"."id", "old_regions"."time_created", "old_regions"."time_modified", "old_regions"."dataset_id", "old_regions"."volume_id", "old_regions"."block_size", "old_regions"."blocks_per_extent", "old_regions"."extent_count" FROM ("old_regions" INNER JOIN "dataset" ON ("old_regions"."dataset_id" = "dataset"."id"))) UNION (SELECT "updated_datasets"."id", "updated_datasets"."time_created", "updated_datasets"."time_modified", "updated_datasets"."time_deleted", "updated_datasets"."rcgen", "updated_datasets"."pool_id", "updated_datasets"."ip", "updated_datasets"."port", "updated_datasets"."kind", "updated_datasets"."size_used", "inserted_regions"."id", "inserted_regions"."time_created", "inserted_regions"."time_modified", "inserted_regions"."dataset_id", "inserted_regions"."volume_id", "inserted_regions"."block_size", "inserted_regions"."blocks_per_extent", "inserted_regions"."extent_count" FROM ("inserted_regions" INNER JOIN "updated_datasets" ON ("inserted_regions"."dataset_id" = "updated_datasets"."id"))) -- binds: [5774f622-35c0-4122-82c6-009f90872553, InService, Active, [33, 170, 66, 197, 122, 209, 190, 23, 0, 0, 0, 0, 0, 0, 0, 0], Crucible, [33, 170, 66, 197, 122, 209, 190, 23, 0, 0, 0, 0, 0, 0, 0, 0], [33, 170, 66, 197, 122, 209, 190, 23, 0, 0, 0, 0, 0, 0, 0, 0], 3, 5774f622-35c0-4122-82c6-009f90872553, 512, 131072, 160, 1, 3, 1, 3, 1, 3, 1, 3, 1, 1, 1]
jgallagher commented 8 months ago

We poked at this more and it might be a cockroach issue? At least, the reproducer we came up with that shows similar behavior does not do a full scan in Postgres: https://github.com/cockroachdb/cockroach/issues/120866