oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
220 stars 33 forks source link

crdb memory budget exceeded error when creating a large number of instances concurrently #5904

Open askfongjojo opened 2 weeks ago

askfongjojo commented 2 weeks ago

While creating 70 instances concurrently using the same Terraform plan I used many times before without any problem, I'm now hitting a number of 500 errors in every run consistently. The errors in nexus log all look like the one below:

02:12:24.115Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = saga ACTION error at node "output": unexpected database error: root: memory budget exceeded: 7854088 bytes requested, 127439909 currently allocated, 134217728 bytes in budget
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/0cd0e82/dropshot/src/server.rs:866
    latency_us = 1918719
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 172.20.17.42:54427
    req_id = 9c1272f5-2184-4f5b-b7c9-5b6b43adad1c
    response_code = 500
    uri = https://oxide.sys.rack2.eng.oxide.computer/v1/instances?project=fe0da422-5c48-4b52-8010-f2fc401f090f

Using this D script against the nexus process running in sled 8, I was able to capture the database error in question:

conn_id: f2bdf557-5a71-4a0f-9a98-83a50a11f1cd, latency: 2329 us, query: 'INSERT INTO "saga_node_event" ("saga_id", "node_id", "event_type", "data", "event_time", "creator") VALUES ($1, $2, $3, $4, $5, $6) -- binds: [SagaId(9225d6c7-6369-40fa-b9cf-e3b698434a66), SagaNodeId(5), "failed", Object {"ActionFailed": Object {"source_error": Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 7854088 bytes requested, 133920542 currently allocated, 134217728 bytes in budget")}}}}, 2024-06-15T04:17:54.044368305Z, 65a11c18-7f59-41ac-b9e7-680627f996e7]'

It is however an insert statement and didn't appear to be the culprit. I had a look at 3 different occurrences of the 500 errors and tried to pinpoint the more expensive queries running around the same time. Here is one of those (latency = 346542 us):

WITH found_row AS MATERIALIZED (
  SELECT 
    "vpc_subnet"."id", 
    "vpc_subnet"."name", 
    "vpc_subnet"."description", 
    "vpc_subnet"."time_created", 
    "vpc_subnet"."time_modified", 
    "vpc_subnet"."time_deleted", 
    "vpc_subnet"."vpc_id", 
    "vpc_subnet"."rcgen", 
    "vpc_subnet"."ipv4_block", 
    "vpc_subnet"."ipv6_block" 
  FROM 
    "vpc_subnet" 
  WHERE 
    (
      ("vpc_subnet"."id" = $1) 
      AND (
        "vpc_subnet"."time_deleted" IS NULL
      )
    ) FOR 
  UPDATE 
    ), 
dummy AS MATERIALIZED (
  SELECT 
    IF(
      EXISTS(
        SELECT 
          "id" 
        FROM 
          found_row
      ), 
      TRUE, 
      CAST(1 / 0 AS BOOL)
    )
), 
updated_row AS MATERIALIZED (
  UPDATE 
    "vpc_subnet" 
  SET 
    "rcgen" = "rcgen" + 1 
  WHERE 
    "id" IN (
      SELECT 
        "id" 
      FROM 
        found_row
    ) RETURNING 1
), 
inserted_row AS (
  INSERT INTO "network_interface" (
    "id", "name", "description", "time_created", 
    "time_modified", "time_deleted", 
    "kind", "parent_id", "vpc_id", "subnet_id", 
    "mac", "ip", "slot", "is_primary"
  ) WITH validated_interface(
    "vpc_id", "subnet_id", "parent_id", 
    "slot", "is_primary"
  ) AS (
    SELECT 
      CAST(
        IF(
          COALESCE(
            (
              SELECT 
                "vpc_id" 
              FROM 
                "network_interface" 
              WHERE 
                "time_deleted" IS NULL 
                AND "parent_id" = $2 
                AND "kind" = $3 
              LIMIT 
                1
            ), $4
          ) = $5, 
          $6, 
          'multiple-vpcs'
        ) AS UUID
      ) AS "vpc_id", 
      CAST(
        IF(
          EXISTS(
            SELECT 
              "subnet_id" 
            FROM 
              "network_interface" 
            WHERE 
              "id" != $7 
              AND "parent_id" = $8 
              AND "kind" = $9 
              AND "time_deleted" IS NULL 
              AND "subnet_id" = $10
          ), 
          'non-unique-subnets', 
          $11
        ) AS UUID
      ) AS "subnet_id", 
      (
        CAST(
          CASE COALESCE(
            (
              SELECT 
                CASE WHEN "active_propolis_id" IS NULL THEN "state" ELSE $12 END 
              FROM 
                "instance" 
              WHERE 
                "id" = $13 
                AND "time_deleted" IS NULL
            ), 
            $14
          ) WHEN $15 THEN $16 WHEN $17 THEN $18 WHEN $19 THEN $20 ELSE $21 END AS UUID
        )
      ), 
      (
        SELECT 
          COALESCE(
            (
              SELECT 
                $22 + "shift" AS "slot" 
              FROM 
                (
                  SELECT 
                    generate_series(0, $23) AS "index", 
                    generate_series(0, $24) AS "shift" 
                  UNION ALL 
                  SELECT 
                    generate_series($25, $26) AS "index", 
                    generate_series($27, -1) AS "shift"
                ) 
                LEFT OUTER JOIN "network_interface" ON (
                  "parent_id", "slot", "time_deleted" IS NULL
                ) = ($28, $29 + "shift", TRUE) 
              WHERE 
                "slot" IS NULL 
              ORDER BY 
                "index" 
              LIMIT 
                1
            ), 0
          )
      ), 
      (
        SELECT 
          NOT EXISTS(
            SELECT 
              1 
            FROM 
              "network_interface" 
            WHERE 
              "parent_id" = $30 
              AND "kind" = $31 
              AND "time_deleted" IS NULL 
            LIMIT 
              1
          )
      )
  ) 
  SELECT 
    $32 AS "id", 
    $33 AS "name", 
    $34 AS "description", 
    $35 AS "time_created", 
    $36 AS "time_modified", 
    $37 AS "time_deleted", 
    $38 AS "kind", 
    $39 AS "parent_id", 
    (
      SELECT 
        "vpc_id" 
      FROM 
        validated_interface
    ), 
    (
      SELECT 
        "subnet_id" 
      FROM 
        validated_interface
    ), 
    (
      SELECT 
        $40 + "shift" AS "mac" 
      FROM 
        (
          SELECT 
            generate_series(0, $41) AS "index", 
            generate_series(0, $42) AS "shift" 
          UNION ALL 
          SELECT 
            generate_series($43, $44) AS "index", 
            generate_series($45, -1) AS "shift"
        ) 
        LEFT OUTER JOIN "network_interface" ON (
          "vpc_id", "mac", "time_deleted" IS NULL
        ) = ($46, $47 + "shift", TRUE) 
      WHERE 
        "mac" IS NULL 
      ORDER BY 
        "index" 
      LIMIT 
        1
    ) AS "mac", 
    (
      SELECT 
        $48 + "shift" AS "ip" 
      FROM 
        (
          SELECT 
            generate_series(0, $49) AS "index", 
            generate_series(0, $50) AS "shift" 
          UNION ALL 
          SELECT 
            generate_series($51, $52) AS "index", 
            generate_series($53, -1) AS "shift"
        ) 
        LEFT OUTER JOIN "network_interface" ON (
          "subnet_id", "ip", "time_deleted" IS NULL
        ) = ($54, $55 + "shift", TRUE) 
      WHERE 
        "ip" IS NULL 
      ORDER BY 
        "index" 
      LIMIT 
        1
    ) AS "ip", 
    (
      SELECT 
        "slot" 
      FROM 
        validated_interface
    ), 
    (
      SELECT 
        "is_primary" 
      FROM 
        validated_interface
    ) RETURNING "network_interface"."id", 
    "network_interface"."name", 
    "network_interface"."description", 
    "network_interface"."time_created", 
    "network_interface"."time_modified", 
    "network_interface"."time_deleted", 
    "network_interface"."kind", 
    "network_interface"."parent_id", 
    "network_interface"."vpc_id", 
    "network_interface"."subnet_id", 
    "network_interface"."mac", 
    "network_interface"."ip", 
    "network_interface"."slot", 
    "network_interface"."is_primary"
) 
SELECT 
  * 
FROM 
  inserted_row -- binds: [7612ed87-eff5-433f-b781-cf4cbeb75def, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, Instance, 91a91cab-2fbd-4c1e-a91f-4f2bcae5705b, 91a91cab-2fbd-4c1e-a91f-4f2bcae5705b, "91a91cab-2fbd-4c1e-a91f-4f2bcae5705b", f0c3f7b4-3190-43d8-b695-3ac56b995077, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, Instance, 7612ed87-eff5-433f-b781-cf4cbeb75def, "7612ed87-eff5-433f-b781-cf4cbeb75def", Vmm, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, Destroyed, NoVmm, "aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b", Creating, "aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b", Destroyed, "no-instance", "bad-state", 0, 8, 8, 9, 8, 0, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, 0, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, Instance, f0c3f7b4-3190-43d8-b695-3ac56b995077, Name(Name("net0")), "net0", 2024-06-15T04:17:53.697381777Z, 2024-06-15T04:17:53.697381777Z, None, Instance, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, MacAddr(MacAddr(MacAddr6([168, 64, 37, 248, 102, 112]))), 432527, 432527, 432528, 983039, -550512, 91a91cab-2fbd-4c1e-a91f-4f2bcae5705b, MacAddr(MacAddr(MacAddr6([168, 64, 37, 248, 102, 112]))), V4(Ipv4Network { addr: 192.168.128.5, prefix: 32 }), 249, 249, 250, 249, 0, 7612ed87-eff5-433f-b781-cf4cbeb75def, V4(Ipv4Network { addr: 192.168.128.5, prefix: 32 })]

(a grep of this query pattern shows the latency was as high as 1523657 us during the window when I was running terraform apply and tracing the database queries).

The complete nexus log and db query tracing output are located in catacomb:/staff/dogfood/omicron-5904.

askfongjojo commented 2 weeks ago

The omicron commit deployed in the environment I used (rack2) was 90bc09ed0838dacb7a299b0ffbfd07feb4608ba7.

The last time I ran the same terraform test was on commit d79a51d57bdf324947275841ac849f2b37edff3a and didn't hit this issue.

jgallagher commented 2 weeks ago

I suspect this subquery is the (or part of the) problem:

    (
      SELECT 
        $40 + "shift" AS "mac" 
      FROM 
        (
          SELECT 
            generate_series(0, $41) AS "index", 
            generate_series(0, $42) AS "shift" 
          UNION ALL 
          SELECT 
            generate_series($43, $44) AS "index", 
            generate_series($45, -1) AS "shift"
        ) 
        LEFT OUTER JOIN "network_interface" ON (
          "vpc_id", "mac", "time_deleted" IS NULL
        ) = ($46, $47 + "shift", TRUE) 
      WHERE 
        "mac" IS NULL 
      ORDER BY 
        "index" 
      LIMIT 
        1
    ) AS "mac", 

After substituting bind parameters, this becomes:

      SELECT 
        184993468409456 + "shift" AS "mac" 
      FROM 
        (
          SELECT 
            generate_series(0, 432527) AS "index", 
            generate_series(0, 432527) AS "shift" 
          UNION ALL 
          SELECT 
            generate_series(432528, 983039) AS "index", 
            generate_series(-550512, -1) AS "shift"
        ) 
        LEFT OUTER JOIN "network_interface" ON (
          "vpc_id", "mac", "time_deleted" IS NULL
        ) = ('91a91cab-2fbd-4c1e-a91f-4f2bcae5705b', 184993468409456 + "shift", TRUE) 
      WHERE 
        "mac" IS NULL 
      ORDER BY 
        "index" 
      LIMIT 
        1

which has some pretty large generate_series() calls. Running just that query in isolation through EXPLAIN ANALYZE emits:

                                                        info
--------------------------------------------------------------------------------------------------------------------
  planning time: 2ms
  execution time: 431ms
  distribution: full
  vectorized: true
  rows read from KV: 117 (8.6 KiB)
  cumulative time spent in KV: 849µs
  maximum memory usage: 25 MiB
  network usage: 11 MiB (660 messages)

  • render
  │ nodes: n2
  │ actual row count: 1
  │ estimated row count: 1
  │
  └── • top-k
      │ nodes: n2
      │ actual row count: 1
      │ estimated row count: 1
      │ order: +"index"
      │ k: 1
      │
      └── • filter
          │ nodes: n1, n2
          │ actual row count: 982,923
          │ estimated row count: 1
          │ filter: mac IS NULL
          │
          └── • hash join (right outer)
              │ nodes: n1, n2
              │ actual row count: 983,040
              │ estimated max memory allocated: 50 MiB
              │ estimated max sql temp disk usage: 0 B
              │ estimated row count: 20
              │ equality: (mac) = (column23)
              │
              ├── • render
              │   │ nodes: n1
              │   │ actual row count: 117
              │   │ KV time: 849µs
              │   │ KV contention time: 0µs
              │   │ KV rows read: 117
              │   │ KV bytes read: 8.6 KiB
              │   │ estimated max memory allocated: 20 KiB
              │   │ estimated row count: 45
              │   │
              │   └── • scan
              │         nodes: n1
              │         actual row count: 117
              │         KV time: 849µs
              │         KV contention time: 0µs
              │         KV rows read: 117
              │         KV bytes read: 8.6 KiB
              │         estimated max memory allocated: 20 KiB
              │         estimated row count: 45 (0.34% of the table; stats collected 3 days ago)
              │         table: network_interface@network_interface_vpc_id_mac_key (partial index)
              │         spans: [/'91a91cab-2fbd-4c1e-a91f-4f2bcae5705b' - /'91a91cab-2fbd-4c1e-a91f-4f2bcae5705b']
              │
              └── • render
                  │ nodes: n2
                  │ actual row count: 983,040
                  │ estimated row count: 20
                  │
                  └── • union all
                      │ nodes: n2
                      │ actual row count: 983,040
                      │ estimated row count: 20
                      │
                      ├── • project set
                      │   │ nodes: n2
                      │   │ actual row count: 432,528
                      │   │ estimated row count: 10
                      │   │
                      │   └── • emptyrow
                      │         nodes: n2
                      │         actual row count: 1
                      │
                      └── • project set
                          │ nodes: n2
                          │ actual row count: 550,512
                          │ estimated row count: 10
                          │
                          └── • emptyrow
                                nodes: n2
                                actual row count: 1
(84 rows)

Time: 435ms total (execution 434ms / network 1ms)

Note the actual memory usage was 25 MiB and the estimated max memory was 50 MiB. Our total query memory budget is only ~128 MiB IIRC, so a few of these running concurrently would max us out. The 983039 value is the distance between MacAddr::MIN_GUEST_ADDR and MacAddr::MAX_GUEST_ADDR.

askfongjojo commented 2 weeks ago

@jgallagher @bnaecker - Thank you for digging into this. As the logic for locating available mac address hasn't changed lately (or for a long time), it is possible that because we have a lot of more background processes running and makeing database queries, the expensive subquery has become more prone to the memory budget limit.

I wonder if we need to re-evaluate the memory budget allocated besides tuning this particular subquery.

cc @smklein

bnaecker commented 2 weeks ago

I've been mulling on this for a bit. The next-item style queries all work the same way: join the existing entries with all possible entries, and take the first row where the existing entry is NULL. That is, select the first possible entry that's not already in the table.

This is fine, but obviously expensive because we form the full set of all possible values. For things like MACs this expensive, but for IPv6 addresses it will be literally impossible. In theory it should be possible to do this with a self-join instead, between the current set of values and the "next one", e.g., mac and mac + 1. That would at least limit the query to the size of the allocated entries, rather than these existing generate_series() queries, which consume space proportional to the entire possible range. It's strictly less memory, at least.