[BUG] Clarify the way Services of Type LoadBalancer and ANPs will work

tssurya commented 4 months ago

What happened: Meeting notes from 27th Feb 2024:

RE: policies and services : In Network Policies Cluster ingress and egress mechanisms often require rewriting the source or destination IP of packets. In cases where this happens, it is not defined whether this happens before or after NetworkPolicy processing, and the behavior may be different for different combinations of network plugin, cloud provider, Service implementation, etc. For egress, this means that connections from pods to Service IPs that get rewritten to cluster-external IPs may or may not be subject to ipBlock-based policies.

We should certainly make pod2pod work even if its through the services (clusterIP)
we should get rid of “undefined” and “unspecified” behaviour for ANP; we shouldn’t have the same ambiguity as we have for NetPol
Svc rewrite happens before the netpol application => rule that matches service IPs has no effect since it anyways doesn’t or won’t match on that!
User case: block users from connecting to x service => block the endpoints instead?
CONSENSUS: Write, don't do it for service VIPs! => clarify which IPs? clusterIPs, externalIPs, loadBalancerVIPs
Let’s also update the NPEP and ensure we call this out!
Cilium’s implementation ignores CIDR block totally for internal traffic
Cannot define pods by IP
Only labels

What you expected to happen:

We are trying to clarify at least the egress bits in the API change here: https://github.com/kubernetes-sigs/network-policy-api/pull/185

See ideas from @danwinship 's comment here: https://github.com/kubernetes-sigs/network-policy-api/pull/185#discussion_r1506114425

We need to zoom in on an agreement and get that done in a separate PR so that the original PR can move forward first.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

tssurya commented 4 months ago

/cc @joestringer

tssurya commented 4 months ago

There are two things we are trying to clarify here:

for east-west traffic ; a peer and subject as of today for ANP/BANP are always pod, nothing else. IN such a case, when we have the policy peer pod trying to talk to the subject pod and here the subject pod is also a backed for a service (clusterIP, externalIP, loadbalancerIP); so ingress traffic we need to define how policies will effect this and same for vice versa where peer pod is also backend and the subject pod is trying to talk to the peer pod -> egress use case
For northbound, Networks peer and services and policies.

For clusterIPs and externalIPs we know that policy will be applied after the packet is DNATed a.k.a rewritten, so having a service is actually like not having any. However the tricky part is with loadBalancers and this is what we are trying to address in this issue.

The text we have for case1:

// Note that presence of a Service object with this policy subject as its backend
// has no impact on the behavior of the policy applied to the peer
// trying to talk to the Service. It will work in the same way as if the
// Service didn't exist since policy is applied after ServiceVIP (clusterIP,
// externalIP, loadBalancerIngressIP) is rewritten to the backendIPs.
//

AND

//
// Note that presence of a Service object with this peer as its backend
// has no impact on the behavior of the policy applied to the subject
// trying to talk to the Service. It will work in the same way as if the
// Service didn't exist since policy is applied after ServiceVIP (clusterIP,
// externalIP, loadBalancerIngressIP) is rewritten to the backendIPs.
//

Text we have for case2:

Note that because policies are applied after Service VIPs (clusterIPs, externalIPs,
load balancer IPs) are rewritten to endpoint IPs, a Networks selector cannot match
such a VIP. For example, a Networks selector that denies traffic to the entire
service CIDR will not actually block any service traffic.

IN all these cases the wordings are wrong for load balancers because the way AWS works using hostnames where traffic leaves the cluster probably cannot be predicted at all as to what happens with the policy versus for GCP/Azure where we can hairpin the DNAT within the cluster without sending it out will work differently.

tssurya commented 4 months ago

I'd like to capture what was said here https://github.com/kubernetes-sigs/network-policy-api/pull/185#discussion_r1506114425 by @danwinship

Unfortunately, this isn't quite right; ClusterIPs and ExternalIPs are always rewritten before NP enforcement, but load balancer IPs sometimes aren't; in some cases, pod-to-LoadBalancer traffic must be sent to the load balancer rather than being processed locally. (This would happen either if the CloudProvider sets hostname rather than ip in the service's LoadBalancerStatus (like AWS), or if it sets ip but also uses the new ipMode: Proxy.)

And once a packet leaves the node and goes to an external load balancer, then all bets are off. When the packet comes back into the cluster, it's not even theoretically possible for the ANP implementation to reliably figure out its original source IP (since the LB may have transformed the connection in arbitrary ways).

(And of course, if the Service has endpoint IPs that are cluster-external, then a pod-to-LoadBalancer connection that goes to the LB would presumably then go directly to the cluster-external IPs, without coming back into the cluster at all, so there's no way the ANP implementation could enforce a Networks selector that blocked cluster-external IPs at that point.)

So I can think of four options:

Require LoadBalancer implementations to participate in ANP enforcement, so that we can guarantee that pod-to-LB connections will have the same semantics as pod-to-ClusterIP. This seems really really unlikely to happen, but I wanted to list it just for completeness.
Require ANP implementations to "pre-approve" pod-to-LB traffic by figuring out "if I were to rewrite this traffic myself, would it then pass ANP?", and if not, then drop it rather than forwarding it to the LB. Note that in the AWS-like case this requires using FQDN rules to recognize that traffic is headed for an LB. And then there are theoretical edge cases like "what if one of the service's endpoints is allowed but another one isn't?". So this also doesn't seem great.
Declare that it's undefined whether you can match post-LB traffic (IOW, a Pods selector may or may not catch pod-to-LB-to-pod traffic, and a Networks selector may or may not catch pod-to-LB-to-cluster-external traffic), but declare that you can block pre-LB traffic with a Networks selector that matches the LB IP. (This might be annoying for implementations to get right in the case where the LB traffic doesn't leave the node, but it should not actually be impossible for anyone.). If we were going to do this, we would probably want to say that you can block pod-to-externalIPs traffic too, for consistency, even though we don't run into the same "we don't know what the cloud infra is doing" problem there. (In fact, we could potentially say that this applies to clusterIPs too... we'd have to think that through. This is potentially the simplest solution though, because then the rule just becomes "if the packet got DNATted, then check both the original and the new dest IP against all policies, rather than only checking the new dest IP".)
Declare that all pod-to-LB traffic has undefined semantics with respect to ANP, and policies may have different effects in different clusters. (This is easiest for implementors, but worst for users. And of course, this is what NP does because we hadn't noticed the problem until it was too late and there were already conflicting implementations.)

FTR, note that in the cluster-ingress case (which is even worse), I feel like there's a chance the answer will end up being "delegate policy enforcement to the Gateway rather than trying to do it in ANP". I'm not sure if that has any implications for trying to decide what to do here.

(People may also want to consider how they expect ANPs to interact with other implementation-specific egress functionality (eg, EgressIPs), and whether that implies anything about the best choice for LB behavior.)

tssurya commented 4 months ago

@joestringer and Nathan Sweet also shared how this works in Cilium world via https://github.com/kubernetes-sigs/network-policy-api/pull/185#discussion_r1506883310 and https://github.com/kubernetes-sigs/network-policy-api/pull/185#discussion_r1506848868

joestringer commented 4 months ago

One small note I'd make is that I think that for policy ingress towards a Pod this is simple: I would never expect a service IP to be the peer for the subject of an ingress policy. For the traffic to arrive at the subject of the policy, it must be routed there, and typically that routing decision must be made using the actual IP of the subject. So it's mainly the egress policy case that is ambiguous.

tssurya commented 4 months ago

One small note I'd make is that I think that for policy ingress towards a Pod this is simple: I would never expect a service IP to be the peer for the subject of an ingress policy. For the traffic to arrive at the subject of the policy, it must be routed there, and typically that routing decision must be made using the actual IP of the subject. So it's mainly the egress policy case that is ambiguous.

So I had the exact same thoughts at first, but really its the flip case for ingress where the peer pod tries to talk to the serviceVIP which has the subject pod as the backend right? In that case if we have a policy in place between the peer pod and subject pod, then it won't really respect this serviceVIP....

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 6 days ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes-sigs / network-policy-api

[BUG] Clarify the way Services of Type LoadBalancer and ANPs will work #203