Open askfongjojo opened 2 weeks ago
The omicron commit deployed in the environment I used (rack2) was 90bc09ed0838dacb7a299b0ffbfd07feb4608ba7
.
The last time I ran the same terraform test was on commit d79a51d57bdf324947275841ac849f2b37edff3a
and didn't hit this issue.
I suspect this subquery is the (or part of the) problem:
(
SELECT
$40 + "shift" AS "mac"
FROM
(
SELECT
generate_series(0, $41) AS "index",
generate_series(0, $42) AS "shift"
UNION ALL
SELECT
generate_series($43, $44) AS "index",
generate_series($45, -1) AS "shift"
)
LEFT OUTER JOIN "network_interface" ON (
"vpc_id", "mac", "time_deleted" IS NULL
) = ($46, $47 + "shift", TRUE)
WHERE
"mac" IS NULL
ORDER BY
"index"
LIMIT
1
) AS "mac",
After substituting bind parameters, this becomes:
SELECT
184993468409456 + "shift" AS "mac"
FROM
(
SELECT
generate_series(0, 432527) AS "index",
generate_series(0, 432527) AS "shift"
UNION ALL
SELECT
generate_series(432528, 983039) AS "index",
generate_series(-550512, -1) AS "shift"
)
LEFT OUTER JOIN "network_interface" ON (
"vpc_id", "mac", "time_deleted" IS NULL
) = ('91a91cab-2fbd-4c1e-a91f-4f2bcae5705b', 184993468409456 + "shift", TRUE)
WHERE
"mac" IS NULL
ORDER BY
"index"
LIMIT
1
which has some pretty large generate_series()
calls. Running just that query in isolation through EXPLAIN ANALYZE
emits:
info
--------------------------------------------------------------------------------------------------------------------
planning time: 2ms
execution time: 431ms
distribution: full
vectorized: true
rows read from KV: 117 (8.6 KiB)
cumulative time spent in KV: 849µs
maximum memory usage: 25 MiB
network usage: 11 MiB (660 messages)
• render
│ nodes: n2
│ actual row count: 1
│ estimated row count: 1
│
└── • top-k
│ nodes: n2
│ actual row count: 1
│ estimated row count: 1
│ order: +"index"
│ k: 1
│
└── • filter
│ nodes: n1, n2
│ actual row count: 982,923
│ estimated row count: 1
│ filter: mac IS NULL
│
└── • hash join (right outer)
│ nodes: n1, n2
│ actual row count: 983,040
│ estimated max memory allocated: 50 MiB
│ estimated max sql temp disk usage: 0 B
│ estimated row count: 20
│ equality: (mac) = (column23)
│
├── • render
│ │ nodes: n1
│ │ actual row count: 117
│ │ KV time: 849µs
│ │ KV contention time: 0µs
│ │ KV rows read: 117
│ │ KV bytes read: 8.6 KiB
│ │ estimated max memory allocated: 20 KiB
│ │ estimated row count: 45
│ │
│ └── • scan
│ nodes: n1
│ actual row count: 117
│ KV time: 849µs
│ KV contention time: 0µs
│ KV rows read: 117
│ KV bytes read: 8.6 KiB
│ estimated max memory allocated: 20 KiB
│ estimated row count: 45 (0.34% of the table; stats collected 3 days ago)
│ table: network_interface@network_interface_vpc_id_mac_key (partial index)
│ spans: [/'91a91cab-2fbd-4c1e-a91f-4f2bcae5705b' - /'91a91cab-2fbd-4c1e-a91f-4f2bcae5705b']
│
└── • render
│ nodes: n2
│ actual row count: 983,040
│ estimated row count: 20
│
└── • union all
│ nodes: n2
│ actual row count: 983,040
│ estimated row count: 20
│
├── • project set
│ │ nodes: n2
│ │ actual row count: 432,528
│ │ estimated row count: 10
│ │
│ └── • emptyrow
│ nodes: n2
│ actual row count: 1
│
└── • project set
│ nodes: n2
│ actual row count: 550,512
│ estimated row count: 10
│
└── • emptyrow
nodes: n2
actual row count: 1
(84 rows)
Time: 435ms total (execution 434ms / network 1ms)
Note the actual memory usage was 25 MiB and the estimated max memory was 50 MiB. Our total query memory budget is only ~128 MiB IIRC, so a few of these running concurrently would max us out. The 983039
value is the distance between MacAddr::MIN_GUEST_ADDR
and MacAddr::MAX_GUEST_ADDR
.
@jgallagher @bnaecker - Thank you for digging into this. As the logic for locating available mac address hasn't changed lately (or for a long time), it is possible that because we have a lot of more background processes running and makeing database queries, the expensive subquery has become more prone to the memory budget limit.
I wonder if we need to re-evaluate the memory budget allocated besides tuning this particular subquery.
cc @smklein
I've been mulling on this for a bit. The next-item style queries all work the same way: join the existing entries with all possible entries, and take the first row where the existing entry is NULL. That is, select the first possible entry that's not already in the table.
This is fine, but obviously expensive because we form the full set of all possible values. For things like MACs this expensive, but for IPv6 addresses it will be literally impossible. In theory it should be possible to do this with a self-join instead, between the current set of values and the "next one", e.g., mac
and mac + 1
. That would at least limit the query to the size of the allocated entries, rather than these existing generate_series()
queries, which consume space proportional to the entire possible range. It's strictly less memory, at least.
While creating 70 instances concurrently using the same Terraform plan I used many times before without any problem, I'm now hitting a number of 500 errors in every run consistently. The errors in nexus log all look like the one below:
Using this D script against the nexus process running in sled 8, I was able to capture the database error in question:
It is however an insert statement and didn't appear to be the culprit. I had a look at 3 different occurrences of the 500 errors and tried to pinpoint the more expensive queries running around the same time. Here is one of those (latency = 346542 us):
(a grep of this query pattern shows the latency was as high as 1523657 us during the window when I was running
terraform apply
and tracing the database queries).The complete nexus log and db query tracing output are located in catacomb:/staff/dogfood/omicron-5904.