Nexus needs to be aware of ASIC limits

bnaecker commented 2 weeks ago

This came up in chat. On the colo rack, @augustuswm saw some failures to completely sync the NAT tables for VMs from Nexus to Dendrite. The ultimate cause of that appears to be that we're hitting the NAT table size limit in Dendrite. That's 1024 NAT entries, today, separately for IPv4 and IPv6 addresses.

From the Dendrite logs:

17:27:20.843Z ERRO dpd: failed to add nat entry 172.21.252.58/[16384-32767] -> fd00:1122:3344:10a::1/a8:40:25:f3:40:d3/15436040: TableFull("pipe.Ingress.nat_ingress.ingress_ipv4")
17:27:20.843Z ERRO dpd: unable to create nat entry
    error = TableFull("pipe.Ingress.nat_ingress.ingress_ipv4")

As @internet-diglett pointed out, this is in a Dendrite RPW (not Nexus), and since dpd is pulling the list of NAT entries periodically, there is no way for Nexus to really learn about this error via a response code or similar.

We need to track table size limits like this in Nexus, and fail to provision resources that violate these (external IP addresses in this case). It's not entirely clear how to do that in a way that doesn't make upgrade more difficult. But in the short-term, a set of constants in Nexus that track those in Dendrite seems reasonable, if brittle. Avoiding that kind of cross-consolidation dependency in the longer-term will be very important.

bnaecker commented 1 week ago

In control plane huddle on 5 Nov 2024, we decided to implement this in the short term by hard-coding the limit in Nexus and failing a request with a 400 if it would be exceeded. I'm taking this for now, but it is easy and anyone looking for a good first issue in Nexus is welcome to work steal.

bnaecker commented 2 days ago

While working on this, I hit #6394. Here's some context on the intersection of these two issues.

To effectively track ASIC table utilization, I think we need to know the switch we're considering as we account for entries on its tables. If that's true, I think it's probably blocked on having information about the actual switches in the database.

It's possible that we can partially solve this issue without resolving #6394, however. If the only ASIC table we want to track is that storing NAT entries (e.g., for translating internal to external IPs for instances), then we don't strictly need the switch ID -- that's because we always add those entries to both switches, so the accounting for those is the same.

However, if we want to track the routing and / or address tables specifically, we do need to know what switches are available. That is because things like addresses are assigned to a specific switch port on a specific switch, and need not be symmetric between them.

I'm tempted to reduce the scope of this issue to just the NAT tables, which would mean we're not blocked on #6394. We also can't really effectively track the routing table utilization at all, because nearly all the entries will be populated by the BGP daemon, completely out of band of Nexus. (That's its own issue, that we probably want to resolve too.) The address tables are somewhere in between -- tfportd populates entries automatically for every underlay address and the techport bootstrap addresses; uplinkd populates entries for the front-facing switch ports; and Omicron also populates entries via RSS and any Nexus-mediated switch port settings object (we rarely do the latter, it's mostly RSS).

If we reduce this to just NAT, we could also possibly piggy back on the existing table and code that does virtual resource accounting. That's also very attractive.

oxidecomputer / omicron

Nexus needs to be aware of ASIC limits #6939