CRDv2 data model - Githubissues

crandles commented 4 years ago

The current data model for the IPPool CRD stores allocations for a given range in a single Kubernetes resource.

Known Issues

Under certain failure scenarios IP allocations may become orphaned
- System failures, force deletion of pods (without a grace period)
- not a specific problem with our data model: this seems to be a behavior of (kubelet?) kubernetes that causes the CNI DEL to be skipped
Using (and prevent the over-use of) IP addresses across multiple ranges
- Should 127.0.0.0/22 and 127.0.0.0/23 both be able to allocate 127.0.0.1 ?
Resource is namespace-scoped; some users may desire a cluster-scoped resource (e.g. multi-tenancy)

Open Questions

Is it important to support allocating the same IP in multiple ranges?
- Could we instead move to a model where an IP is a unique resource?
- See #49, except: why use the range resource + the cluster ip lookup? Would the cluster lookup suffice?

Idea

TBD, come up with draft yaml spec/examples. Including some rough ideas now:

Moving to IP as the "base" resource type solves for preventing duplicate IP allocations from occurring in overlapping ranges, but introduces a problem:

How do we easily query kubernetes to determine the next available IP?
- The IPPool type is a useful bucket that we are removing

Can we solve querying IPs from a given range with using well-crafted labels? kubectl get ip -l subnet_31=127.0.0.0 (?) Need to determine labels and consider IPV6.

Given we have a single resource type that is associated with an allocation, we can leverage kubernetes' built-in garbage collection capabilities to solve for the orphaned IP allocation problem.

By configuring the pod as the owner of the IP allocation resource we can instruct Kubernetes to automatically delete the resource type when the pod is deleted. We may still clean up our resources via the CNI DEL, but this would serve as a fall-through to prevent IP exhaustion (With no operator, cron, or other process necessary).

We should be able to create a Namespace-scoped and a Cluster-scoped client: I think either might make sense but it would not make sense to use both concurrently. This should be configured in the whereabouts IPAM config.

Limitations

Moving to such a model would not make it easy to support overlapping IP ranges, we may have to drop that use case. We could potentially keep multiple CRD versions around (v1alpha1, v1beta1, etc) if this was an important use-case.
This does not apply well to the etcd datastore as it relies on Kubernetes to perform the garbage collection.

dougbtv commented 4 years ago

+1 re: the solution for configuring the pod as the owner of the IP allocation, that's quite excellent, great suggestion.

Definitely up for this change / refactor. Thanks for the outline, looking forward to the CRD proposal.

Another consideration: upgrade path (I'll think on this one, too)

dougbtv commented 4 years ago

Another quick thought on the labels, I have this idea about a "sticky IP address using MAC address" -- so if a workload comes back up with the same mac address, they get the same IP address.

This could be a separate store/CRD, but... we could label the IP address CR with a mac address to query for it. We could also release the ownership when this is used so that the IP CR sticks around, too.

crandles commented 3 years ago

Moving to IP as the "base" resource type solves for preventing duplicate IP allocations from occurring in overlapping ranges, but introduces a problem:

How do we easily query kubernetes to determine the next available IP? The IPPool type is a useful bucket that we are removing

I am having second thoughts about this idea. (using a per-ip resource and labels to query for subnet allocations)

IPv6 subnets are very large, and there are many possible subnets; generating 128 labels to enable query lookups seems like a poor design. Additionally, I worry how well this scales.

Alternatively: I believe Kubernetes resources are limited to ~1MB in size (based on etcd limit); the current pool implementation does not factor this in (Should it?)

I could imagine a CRD model that involved ip blocks, owned by an ip pool; they contain n addresses and are allotted to the pool in blocks. This could be a sort-of-medium between sharing IPs between pools and avoiding a per-ip resource.

We would need to do some testing to find the right sizes. The pool data type would hold metadata pointing to the full blocks and the current block with free ips.

IPv6 would need additional testing for the maximum number of sub-blocks, etc. May be impossible to support many subnets sizes in 1MB?

Is it important to support allocating the same IP in multiple ranges?.

This still isn't clear. Should it be optimized for?

Under certain failure scenarios IP allocations may become orphaned

I think we can still leverage Pod owner references + garbage collection by:

creating a new per-pod resource (PoolReservation / PodIPReservation)
or adding annotations to pods

and leveraging an operator + finalizer to ensure the Pool/blocks are maintained.

k8snetworkplumbingwg / whereabouts

CRDv2 data model #51

Known Issues

Open Questions

Idea

Limitations