servicelb with externalTrafficPolicy: Local forwards traffic from all nodes to a pod

dansimko commented 6 months ago

Environmental Info: K3s Version: v1.28.7-rc3+k3s1 (running RC as v1.28.6 has a bug starting up without flannel) Node(s) CPU architecture, OS, and Version: x86_64, Alma Linux 9.3 Cluster Configuration: 1 server, 1 agents

Describe the bug: When exposing a service of type LoadBalancer with externalTrafficPolicy: Local, svclb pods capturing the exposed ports are scheduled on all nodes in the cluster and traffic from all nodes is forwarded to a pod running on one of the nodes.

Steps To Reproduce:

Installed K3s: --flannel-backend=none --disable traefik, using ingress-nginx and customized flannel bound to different interface and --ip-masq=false

kubectl apply

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-service-nginx
namespace: "default"
spec:
replicas: 1
strategy:
type: RollingUpdate
selector:
matchLabels:
  app.kubernetes.io/instance: test-service
  app.kubernetes.io/name: nginx
template:
metadata:
  labels:
    app.kubernetes.io/instance: test-service
    app.kubernetes.io/name: nginx
spec:
  containers:
    - name: nginx
      image: docker.io/bitnami/nginx:1.25.4-debian-12-r2
      imagePullPolicy: "IfNotPresent"
      env:
        - name: NGINX_HTTP_PORT_NUMBER
          value: "8080"
      ports:
        - name: http
          containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: test-service-nginx
namespace: "default"
labels:
app.kubernetes.io/instance: test-service
app.kubernetes.io/name: nginx
spec:
type: LoadBalancer
externalTrafficPolicy: "Local"
ports:
- name: http
  port: 8080
  targetPort: http
selector:
app.kubernetes.io/instance: test-service
app.kubernetes.io/name: nginx

Expected behavior: Load balancer attached to a service with external traffic policy set to Local, should route only node-local traffic to the pod.

Actual behavior: Traffic from all nodes is routed to the pod with traffic received by other nodes showing in the pod as originating from client IP 10.42.0.0.

Additional context / logs:

Service shows correct IP associated:

$ kubectl describe svc test-service-nginx 
Name:                     test-service-nginx
Namespace:                default
Labels:                   app.kubernetes.io/instance=test-service
                          app.kubernetes.io/name=nginx
Annotations:              <none>
Selector:                 app.kubernetes.io/instance=test-service,app.kubernetes.io/name=nginx
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.43.49.37
IPs:                      10.43.49.37
LoadBalancer Ingress:     <agent-node-public-ip>
Port:                     http  8080/TCP
TargetPort:               http/TCP
NodePort:                 http  31664/TCP
Endpoints:                10.42.1.47:8080
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     32102
Events:
  Type    Reason                Age    From                Message
  ----    ------                ----   ----                -------
  Normal  EnsuringLoadBalancer  8m26s  service-controller  Ensuring load balancer
  Normal  AppliedDaemonSet      8m26s                      Applied LoadBalancer DaemonSet kube-system/svclb-test-service-nginx-3e5e41a6
  Normal  UpdatedLoadBalancer   8m9s                       Updated LoadBalancer with new IPs: [] -> [<agent-node-public-ip>]

servicelb pods get deployed to both nodes:

$ kubectl get pods -A -o wide
NAMESPACE       NAME                                            READY   STATUS    RESTARTS      AGE    IP           NODE                  NOMINATED NODE   READINESS GATES
...
kube-system     svclb-test-service-nginx-3e5e41a6-zvxg7         1/1     Running   0             5m6s   10.42.0.18   <master-node-host>    <none>           <none>
kube-system     svclb-test-service-nginx-3e5e41a6-z7cc9         1/1     Running   0             5m6s   10.42.1.46   <agent-node-host>     <none>           <none>
default         test-service-nginx-d497bb98f-vmd66              1/1     Running   0             5m7s   10.42.1.47   <agent-node-host>     <none>           <none>

Accessing the service at the node the pod is running on logs in the pod:

<actual_client_ip> - - [28/Feb/2024:18:44:49 +0000] "GET / HTTP/1.1" 200 409 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0" "-"

If accessed through different node:

10.42.0.0 - - [28/Feb/2024:18:45:02 +0000] "GET / HTTP/1.1" 200 409 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0" "-"

brandond commented 6 months ago

Normal UpdatedLoadBalancer 8m9s Updated LoadBalancer with new IPs: [] -> [<agent-node-public-ip>]

Accessing the service at the node the pod is running on

Are you accessing the service via the ClusterIP, or the node IP listed in the loadbalancer status field? You should only access the LoadBalancer at the listed addresses. If you ignore the status and hit other nodes, you are not guaranteed to get a node with a pod on it.

If you are accessing the Service's ClusterIP directly from the nodes, then you are not accessing it "externally" and the externalTrafficPolicy is not respected. If you want to control whether or not you can access the service's ClusterIP from nodes that do not have a pod for the service, you would want to set internalTrafficPolicy: Local as well.

Ref: https://kubernetes.io/docs/concepts/services-networking/service-traffic-policy/

dansimko commented 6 months ago

I was accessing the service via the listed node IP. The service, however, can be accessed through the other node's IP, too (in which case the information about origin IP is lost).

In this case I was deploying SMTP server with 10.42.0.0/16 as a trusted network, which resulted in exposing the service as an open SMTP relay to the internet if accessed through another node IP (which is not listed in the service).

brandond commented 6 months ago

I think you can probably get the behavior you want if you set both policies to local. Give that a try.

dansimko commented 6 months ago

I really appreciate you getting back to me so quickly. Unfortunately, setting internalTrafficPolicy: Local did not seem to make a difference. Nevertheless, I would have expected that as soon as externalTrafficPolicy: Local is set, servicelb pods would be deployed only to nodes where pods the service is attached to actually run. The current behavior makes it so servicelb occupies specified ports on all nodes indiscriminately (beside a global label filter). Incoming traffic from all nodes is therefore directed to the pod's hostIP, possibly introducing an additional hop between nodes, which should not be happening according to k8s docs.

dansimko commented 6 months ago

I have worked-around this by applying svccontroller.k3s.cattle.io/enablelb labels to all nodes and svccontroller.k3s.cattle.io/lbpool=<poolname> to the nodes the pods are running at and the corresponding service. Would be nice if this sort of behavior was better automated. Also, wouldn't it be more appropriate to use annotations instead of labels to attach this parameter to the service?

brandond commented 6 months ago

you can't use annotations as selectors when querying the Kubernetes API so no. If we did that we would have to manually filter the ones we want instead of just doing a get with a labelSelector.

brandond commented 6 months ago

I looked at trying to improve this a while back by putting affinities on the svclb pods that constrained them to only run on nodes with pods for the backing service, but it didn't really work. Because we use a daemonset, it ended up leaving a bunch of pending pods that didn't meet the scheduler requirements. The scheduler also won't evict pods if the affinity is no longer met, so you could end up with pods left on a node that didn't have pods any longer.

Arc-2023 commented 5 months ago

I am using K3s's svclb as a load balancer and Traefik as an ingress controller. To achieve the effect where traffic from outside only goes through the svclb daemonset and Traefik pod on the current node, setting internalTrafficPolicy to Local is effective. However, I've noticed that on top of this, using externalTrafficPolicy: Local negates the previous effect of "traffic from outside only passing through the svclb daemonset and Traefik pod on the current node," and SNAT still exists.

brandond commented 5 months ago

InternalTrafficPolicy doesn't impact ServiceLB at all, as traffic from within the cluster shouldn't be hitting the external LB anyway. It should just go direct to the ClusterIP service without passing through the svclb pods.

ExternalTrafficPolicy: Local should retain the source address, as it doesn't take the extra hop through kube-router that adds SNAT to support routing the traffic from one node to another. Can you show where you're seeing otherwise?

sambartik commented 1 month ago

Has the original problem been resolved? I am facing very similar issue as the author. When accessing the port on a node that doesn't run the pod (via the node's external IP) that the load balancer targets, it forwards the packets and loses the client's IP in the process. Is there a way to either block such access completely?

brandond commented 1 month ago

@sambartik can you confirm that you've correctly set the ExternalTrafficPolicy on your service, as described above? If you do that, you shouldn't be able to hit the service at all if it doesn't have a replica on that node...

sambartik commented 1 month ago

@brandond Yes I did set the ExternalTrafficPolicy to Local and it did have an effect.

To paint a more clear picture, this is my test k3s setup with servicelb enabled: Node 1, with an external IP 1.1.1.1 Node 2, with an external IP 2.2.2.2

Now, I have created the Deployment and Service loadbalancer resources as the original author has described. As a result there is a single pod scheduled only on one of the nodes mentioned earlier. Let's say the nginx pod that listens on port 8080 was scheduled on the Node 1.

1) Trying to access the service from outside the cluster via Node 1 IP: curl http://1.1.1.1:8080 works as expected and the nginx pod logged my real client IP.

2) However, the problem arises when we try to access the service via Node 2 IP: curl http://2.2.2.2:8080. The nginx pod no longer sees my real client IP, but rather an internal cluster IP. This poses a problem for me.

I do know that ExternalTrafficPolicy set to "Local" had an effect, because if it had been set to "Cluster", the 1) scenario wouldn't have worked. Also, when it was set to "Local" the external IP shown in kubectl get svc was a list of IPs of nodes running the nginx pod (therefore in my example, the Node 2 external IP was not shown there).

Please let me know if you need any more information, I would like to see this issue resolved as it also causes me an open relay issue.

k3s-io / k3s

servicelb with externalTrafficPolicy: Local forwards traffic from all nodes to a pod #9592