Closed david-crespo closed 6 months ago
After taking a looking at this at @luqmana's suggestion as part of #5674, I'm realizing this is a bit weird, though it still might be worthwhile. When we say we want to follow the "specified pool", there is no general specified pool for the instance at create time: the pool is specified for the ephemeral IP. So is it right to say that the SNAT IP should always come from the same pool as the ephemeral IP (if there is one)? And if there is no ephemeral IP asked for (which you can do by leaving the external IPs array empty) we will use the default pool.
https://github.com/oxidecomputer/omicron/issues/4317 points out that if you have an ephemeral IP, OPTE will use that and you don't really need a SNAT IP. To me that points in the opposite direction of this issue: If there is an ephemeral IP specified, don't allocate a SNAT IP. Only if there is no ephemeral IP specified would you need the SNAT IP, in which case you would use the default pool. So SNAT IPs would always come from the default pool.
I guess then it comes down to do we want to allow operators a choice of pool for SNAT IPs?
I have been assuming that 99+% of the time there will only be one pool linked to a silo and it will be the default, so the question of picking pools is usually moot. If that is not true and we expect real use cases for multiple pools in a silo, getting a better picture of the reason for that will help a lot.
There was some question recently in chat about quotas for IPs and pools sorta being a way of doing that.
(cc @augustuswm who's been managing this on colo and might have some perspective)
So an immediate use case for this would be with the colo rack where we want users to have a choice between our public IPs and VPN IPs. The plan is to have at least two pools here linked to the same silo.
If we can use the selected ephemeral IP for SNAT that would be ideal as it would be a large reduction in our current IP usage. It would also keep traffic to those instances on the IPs we expect.
While this is our solution for today, I'm not sure if this is what we want in the future state to expose to end users. IP pools seem like a very coarse grained tool and in a silo we would want to provide some way to carve up those available IPs amongst projects, vpcs, or maybe subnets.
So let me give a bit of context to how this was meant to fit together a bit as some of what was intended doesn't really exist:
So to start with in the original take on RFD 21 we had a notion of the Internet Gateway. The Internet Gateway critically was meant to represent the thing that got you to the Internet like how AWS has the NAT gateway. Internet gateways semantically represent a few things:
That's the first bit. The second bit that's relevant is that the original intent was that if you had an ephemeral or floating IP you would have no SNAT assigned to you. As a speed up to avoid dealing with allocating this at instance start time, this was originally implemented as instance create time.
So from an original design perspective, no SNAT shouldn't come from the chosen pool, it should fall back to the IP pool on the Internet gateway which is something explicit, but probably would inherit the silo default at creation time. However, I want to acknowledge that we're not in that original design. Additionally, in early discussions my assumptions were that we'd end up with the following additional concepts that I think in part get to @augustuswm is hinting at: critically that projects would have their own allow or deny lists of the IP Pools, images, and resources that they can use and their own defaults.
What @augustuswm is wanting to do would work well in the original model where the Internet gateway's default IP pool is the VPN IP pool; however, instances could ask for an external address from the public IP pool.
I'm not sure how much this makes sense or not, but hopefully provides some context. I'm less certain what the current answer should be. I think the reason this feels odd is due to the lack of Internet Gateway.
@rmustacc - In the use case @augustuswm mentioned, if we don't change the current implementation, assuming that the public IP pool is the default pool and the user has chosen to provision a VM with its ephemeral IP allocated from the VPN IP pool - with the intention of it being to able to reach other VMs accessible via VPN, would it be able to do so if its SNAT IP is allocated from the default/public IP pool?
According to this https://github.com/oxidecomputer/omicron/issues/4317 it should use the ephemeral IP in that case anyway. We need to go and confirm that in the code.
If that’s true, I’m thinking the simpler and more future-oriented version of this change is to not bother allocating a SNAT IP when there is an ephemeral IP.
@rmustacc - In the use case @augustuswm mentioned, if we don't change the current implementation, assuming that the public IP pool is the default pool and the user has chosen to provision a VM with its ephemeral IP allocated from the VPN IP pool - with the intention of it being to able to reach other VMs accessible via VPN, would it be able to do so if its SNAT IP is allocated from the default/public IP pool?
The SNAT should never be used if any external IP is allocated and assigned to the instance whether ephemeral or floating.
If that’s true, I’m thinking the simpler and more future-oriented version of this change is to not bother allocating a SNAT IP when there is an ephemeral IP.
Yeah, I think that's fine as long as when that IP is removed we do the right thing and allocate the SNAT then. That was the original long term plan. Similarly when an instance has a floating or ephemeral IP assigned then the SNAT should be removed (of course this is all assuming that it can happen while the instance is up).
Ok, that's interesting. A bit more complicated, but very doable. I will update the issues accordingly and probably close my PR.
Ok, closing this because I think the right direction will be to instead
Right now the SNAT IP always comes from the default pool for the current silo, even when the user has specified they want an external IP from a different pool. We want to change the following logic to use the specified pool instead.
That change is simple, but there’s another wrinkle: what to do about existing instances where the SNAT IP and ephemeral IP come from different pools. This could lead to confusion later when an operator tries to delete an IP range with outstanding IPs. There’s a good chance there aren’t any such instances because nearly all instances are created with the default pool. But it might be worth checking somehow, just so we know. Longer term we will want (and have plans to build) ways of telling exactly where allocated IPs actually are.
https://github.com/oxidecomputer/omicron/blob/ce7c8d3c179b34fb1b1e19d91902fd7e59d05e62/nexus/src/app/sagas/instance_create.rs#L646-L655