oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

External IP allocation query will break with huge IP ranges #1371

Open bnaecker opened 2 years ago

bnaecker commented 2 years ago

The query for allocating external IPs from an IP Pool range currently generates the sequence of all IPs in that range. That's inefficient, but works at this point. However, there's a latent bug here. That generates the sequence of IP addresses by computing offsets, using generate_series(0, last_address - first_address), and adding those offsets to the first address. That subtraction will not fit in an i64 if the address range is huge. For example:

demo@127.0.0.1:26257/movr> select 'fd00::'::INET - 'fc00::'::INET;
ERROR: result out of range
SQLSTATE: 22003
demo@127.0.0.1:26257/movr>

That's not super likely, but it should be detected and handled correctly, by taking the maximum allowed offset from the first address. Something like:

demo@127.0.0.1:26257/movr> select if('fc00::'::INET + 9223372036854775807 > 'fd00::'::INET, 'fd00::'::INET, 'fc00::'::INET + 9223372036854775807) - 'fc00::'::INET;
       ?column?
-----------------------
  9223372036854775807
(1 row)

Time: 0ms total (execution 0ms / network 0ms)

demo@127.0.0.1:26257/movr>

That constant is i64::MAX, and can be supplied on the Rust side or in the database. The IF is needed because there's no maximum function for INET types.

bnaecker commented 2 years ago

This brings up a related and maybe larger question. The current query will actually not break, but have horrible performance if the range is huge but doesn't overflow an i64. Because all subqueries are currently cached in their entirety in memory, CRDB will happily generate and store a sequence of i64::MAX - 1 addresses, assuming the system has that much memory. We need to also limit the size of ranges.

However, the size of ranges isn't enough. The existing query searches over all ranges, of all pools, that are not reserved for another project. So even if we limit the size of each range, that's not enough to limit the size of the sequence generated by this query.

One way to do that would be limit the size of each pool, and fail requests to add a new range if the resulting size would be larger than that limit. We'd also have to limit the number of pools, to have a hard cap on the size of that query.

Another option would be to rewrite the query. This has come up in a number of ways, but we would really like a way to write queries that return the next available item without generating the full sequence of available items. As an example, we'd keep track of the allocated IPs in rows that track the first and last free address; allocating the next IP is finding the row with the lowest free address and bumping that by 1. Freeing an address is harder though, since we'll have to handle coalescing ranges in a few different ways, depending on whether there are neighboring ranges.

bnaecker commented 2 years ago

I talked with @davepacheco about this. A short-term solution is to (1) limit the size of IP Pool ranges at this point, and (2) work in batches (both here and other similar queries). For the second part, we'd modify the query to accept a start and batch size, and then issue the same query with those as the first and second argument to the generate_series call which creates the sequence of IP addresses for each pool and range.

This changes the order in which the addresses will be allocated. Previously, it would be sequential over all pools and IP ranges. The addresses would now be striped across the ranges, though still sequential within reach stripe. That's not a problem I don't think, just a difference.

bnaecker commented 2 years ago

It occurs to me that this same problem will affect guest VPC-private addresses in the network_interface table as well. Those use a NextItem query, which internally does generate_series(0, n_addresses_in_subnet - 1). When we actually support IPv6 addresses for those interfaces, that query will not never complete, given CRDB's implementation of materializing all subqueries. This needs to be fixed, e.g., by: