Open jgallagher opened 8 months ago
Formatted version of the query:
WITH "old_zpool_usage" AS
(
SELECT "dataset"."pool_id", Sum("dataset"."size_used") AS size_used
FROM "dataset"
WHERE (("dataset"."size_used" IS NOT NULL)
AND
("dataset"."time_deleted" IS NULL))
GROUP BY "dataset"."pool_id")
), "candidate_zpools" AS
(
SELECT DISTINCT ON ("zpool"."sled_id") "old_zpool_usage"."pool_id"
FROM ("old_zpool_usage"
INNER JOIN ("zpool" INNER JOIN "sled" ON ("zpool"."sled_id" = "sled"."id")) ON ( "zpool"."id" = "old_zpool_usage"."pool_id"))
WHERE (((("old_zpool_usage"."size_used" + 10737418240) <=
(SELECT total_size
FROM omicron.PUBLIC.inv_zpool
WHERE inv_zpool.id = old_zpool_usage.pool_id
ORDER BY inv_zpool.time_collected DESC limit 1))
AND ("sled"."sled_policy" = 'in_service'))
AND ("sled"."sled_state" = 'active'))
ORDER BY "zpool"."sled_id", Md5((Cast("zpool"."id" AS BYTEA))))
SELECT *
FROM candidate_zpools
@smklein pointed out that my use of SELECT * FROM candidate_zpools
might be inducing a full scan. The full CTE from the dtrace log is:
WITH "old_regions" AS (SELECT "region"."id", "region"."time_created", "region"."time_modified", "region"."dataset_id", "region"."volume_id", "region"."block_size", "region"."blocks_per_extent", "region"."extent_count" FROM "region" WHERE ("region"."volume_id" = $1)), "old_zpool_usage" AS (SELECT "dataset"."pool_id", sum("dataset"."size_used") AS size_used FROM "dataset" WHERE (("dataset"."size_used" IS NOT NULL) AND ("dataset"."time_deleted" IS NULL)) GROUP BY "dataset"."pool_id"), "candidate_zpools" AS (SELECT DISTINCT ON ("zpool"."sled_id")"old_zpool_usage"."pool_id" FROM ("old_zpool_usage" INNER JOIN ("zpool" INNER JOIN "sled" ON ("zpool"."sled_id" = "sled"."id")) ON ("zpool"."id" = "old_zpool_usage"."pool_id")) WHERE (((("old_zpool_usage"."size_used" + 10737418240) <= (SELECT total_size FROM omicron.public.inv_zpool WHERE
inv_zpool.id = old_zpool_usage.pool_id
ORDER BY inv_zpool.time_collected DESC LIMIT 1)) AND ("sled"."sled_policy" = $2)) AND ("sled"."sled_state" = $3)) ORDER BY "zpool"."sled_id", md5((CAST("zpool"."id" as BYTEA) || $4))), "candidate_datasets" AS (SELECT DISTINCT ON ("dataset"."pool_id")"dataset"."id", "dataset"."pool_id" FROM ("dataset" INNER JOIN "candidate_zpools" ON ("dataset"."pool_id" = "candidate_zpools"."pool_id")) WHERE ((("dataset"."time_deleted" IS NULL) AND ("dataset"."size_used" IS NOT NULL)) AND ("dataset"."kind" = $5)) ORDER BY "dataset"."pool_id", md5((CAST("dataset"."id" as BYTEA) || $6))), "shuffled_candidate_datasets" AS (SELECT "candidate_datasets"."id", "candidate_datasets"."pool_id" FROM "candidate_datasets" ORDER BY md5((CAST("candidate_datasets"."id" as BYTEA) || $7)) LIMIT $8), "candidate_regions" AS (SELECT gen_random_uuid() AS id, now() AS time_created, now() AS time_modified, "shuffled_candidate_datasets"."id" AS dataset_id, $9 AS volume_id, $10 AS block_size, $11 AS blocks_per_extent, $12 AS extent_count FROM "shuffled_candidate_datasets"), "proposed_dataset_changes" AS (SELECT "candidate_regions"."dataset_id" AS id, "dataset"."pool_id" AS pool_id, (("candidate_regions"."block_size" * "candidate_regions"."blocks_per_extent") * "candidate_regions"."extent_count") AS size_used_delta FROM ("candidate_regions" INNER JOIN "dataset" ON ("dataset"."id" = "candidate_regions"."dataset_id"))), "do_insert" AS (SELECT (((((SELECT COUNT(*) FROM "old_regions" LIMIT $13) < $14) AND CAST(IF(((SELECT COUNT(*) FROM "candidate_zpools" LIMIT $15) >= $16), 'TRUE', 'Not enough space') AS BOOL)) AND CAST(IF(((SELECT COUNT(*) FROM "candidate_regions" LIMIT $17) >= $18), 'TRUE', 'Not enough datasets') AS BOOL)) AND CAST(IF(((SELECT COUNT(DISTINCT "dataset"."pool_id") FROM ("candidate_regions" INNER JOIN "dataset" ON ("candidate_regions"."dataset_id" = "dataset"."id")) LIMIT $19) >= $20), 'TRUE', 'Not enough unique zpools selected') AS BOOL)) AS insert), "inserted_regions" AS (INSERT INTO "region" ("id", "time_created", "time_modified", "dataset_id", "volume_id", "block_size", "blocks_per_extent", "extent_count") SELECT "candidate_regions"."id", "candidate_regions"."time_created", "candidate_regions"."time_modified", "candidate_regions"."dataset_id", "candidate_regions"."volume_id", "candidate_regions"."block_size", "candidate_regions"."blocks_per_extent", "candidate_regions"."extent_count" FROM "candidate_regions" WHERE (SELECT "do_insert"."insert" FROM "do_insert" LIMIT $21) RETURNING "region"."id", "region"."time_created", "region"."time_modified", "region"."dataset_id", "region"."volume_id", "region"."block_size", "region"."blocks_per_extent", "region"."extent_count"), "updated_datasets" AS (UPDATE "dataset" SET "size_used" = ("dataset"."size_used" + (SELECT "proposed_dataset_changes"."size_used_delta" FROM "proposed_dataset_changes" WHERE ("proposed_dataset_changes"."id" = "dataset"."id") LIMIT $22)) WHERE (("dataset"."id" = ANY(SELECT "proposed_dataset_changes"."id" FROM "proposed_dataset_changes")) AND (SELECT "do_insert"."insert" FROM "do_insert" LIMIT $23)) RETURNING "dataset"."id", "dataset"."time_created", "dataset"."time_modified", "dataset"."time_deleted", "dataset"."rcgen", "dataset"."pool_id", "dataset"."ip", "dataset"."port", "dataset"."kind", "dataset"."size_used") (SELECT "dataset"."id", "dataset"."time_created", "dataset"."time_modified", "dataset"."time_deleted", "dataset"."rcgen", "dataset"."pool_id", "dataset"."ip", "dataset"."port", "dataset"."kind", "dataset"."size_used", "old_regions"."id", "old_regions"."time_created", "old_regions"."time_modified", "old_regions"."dataset_id", "old_regions"."volume_id", "old_regions"."block_size", "old_regions"."blocks_per_extent", "old_regions"."extent_count" FROM ("old_regions" INNER JOIN "dataset" ON ("old_regions"."dataset_id" = "dataset"."id"))) UNION (SELECT "updated_datasets"."id", "updated_datasets"."time_created", "updated_datasets"."time_modified", "updated_datasets"."time_deleted", "updated_datasets"."rcgen", "updated_datasets"."pool_id", "updated_datasets"."ip", "updated_datasets"."port", "updated_datasets"."kind", "updated_datasets"."size_used", "inserted_regions"."id", "inserted_regions"."time_created", "inserted_regions"."time_modified", "inserted_regions"."dataset_id", "inserted_regions"."volume_id", "inserted_regions"."block_size", "inserted_regions"."blocks_per_extent", "inserted_regions"."extent_count" FROM ("inserted_regions" INNER JOIN "updated_datasets" ON ("inserted_regions"."dataset_id" = "updated_datasets"."id"))) -- binds: [5774f622-35c0-4122-82c6-009f90872553, InService, Active, [33, 170, 66, 197, 122, 209, 190, 23, 0, 0, 0, 0, 0, 0, 0, 0], Crucible, [33, 170, 66, 197, 122, 209, 190, 23, 0, 0, 0, 0, 0, 0, 0, 0], [33, 170, 66, 197, 122, 209, 190, 23, 0, 0, 0, 0, 0, 0, 0, 0], 3, 5774f622-35c0-4122-82c6-009f90872553, 512, 131072, 160, 1, 3, 1, 3, 1, 3, 1, 3, 1, 1, 1]
We poked at this more and it might be a cockroach issue? At least, the reproducer we came up with that shows similar behavior does not do a full scan in Postgres: https://github.com/cockroachdb/cockroach/issues/120866
On #5299, we had a query that was using an inordinate amount of memory because it was
FULL SCAN
ing the pkey index of a table that was erroneously very large. #5301 fixes the "erroneously very large", but it's still surprising that the query in question was doing aFULL SCAN
; when running omicron tests locally, the query required the creation of theinv_zpool_by_id_and_time
index to pass our "no full scans allowed" check.When running a portion of the query by hand under
EXPLAIN ANALYZE
, we can see that even once the table has been pruned down to its expected size of a few hundred rows, we're still full scanninginv_zpool@inv_zpool_pkey
:@smklein pointed out that CRDB supports index hints. If we modify the above query, changing
FROM omicron.public.inv_zpool
toFROM omicron.public.inv_zpool@{NO_FULL_SCAN}
, the full scan ofinv_zpool
is gone, and the max memory used by the query is reduced significantly:There are at least a couple possible issues here:
nexus-db-queries
unit tests, not having the index trips over the "no full scans allowed" check. If we run the query (with the index present but noNO_FULL_SCAN
hint) against an empty database, CRDB uses the index instead of a full scan even though the hint isn't present. Maybe the necessity of the hint depends on the size of the table?