k0sproject / k0s

k0s - The Zero Friction Kubernetes
https://docs.k0sproject.io
Other
3.48k stars 353 forks source link

Kubernetes cronjob cannot access database service using network policy #2261

Closed jsalgado78 closed 1 year ago

jsalgado78 commented 1 year ago

Before creating an issue, make sure you've checked the following:

Platform

# uname -srvmo; cat /etc/os-release || lsb_release -a
Linux 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64 GNU/Linux
NAME="Red Hat Enterprise Linux"
VERSION="8.6 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.6"

Version

v1.24.6+k0s.0

Sysinfo

`k0s sysinfo`
Machine ID: "a8e7bc177f1ee16a0fc1d7752bce28c89a4d4b5c4fb98b1255fe5047c522ac2a" (from machine) (pass)
Total memory: 15.6 GiB (pass)
Disk space available for /var/lib/k0s: 60.9 GiB (pass)
Operating system: Linux (pass)
  Linux kernel release: 4.18.0-372.9.1.el8.x86_64 (pass)
  Max. file descriptors per process: current: 65536 / max: 65536 (pass)
  Executable in path: modprobe: /sbin/modprobe (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 1 (pass)
    cgroup controller "cpu": available (pass)
    cgroup controller "cpuacct": available (pass)
    cgroup controller "cpuset": available (pass)
    cgroup controller "memory": available (pass)
    cgroup controller "devices": available (pass)
    cgroup controller "freezer": available (pass)
    cgroup controller "pids": available (pass)
    cgroup controller "hugetlb": available (pass)
    cgroup controller "blkio": available (pass)
  CONFIG_CGROUPS: Control Group support: built-in (pass)
    CONFIG_CGROUP_FREEZER: Freezer cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_PIDS: PIDs cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_DEVICE: Device controller for cgroups: built-in (pass)
    CONFIG_CPUSETS: Cpuset support: built-in (pass)
    CONFIG_CGROUP_CPUACCT: Simple CPU accounting cgroup subsystem: built-in (pass)
    CONFIG_MEMCG: Memory Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_HUGETLB: HugeTLB Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_SCHED: Group CPU scheduler: built-in (pass)
      CONFIG_FAIR_GROUP_SCHED: Group scheduling for SCHED_OTHER: built-in (pass)
        CONFIG_CFS_BANDWIDTH: CPU bandwidth provisioning for FAIR_GROUP_SCHED: built-in (pass)
    CONFIG_BLK_CGROUP: Block IO controller: built-in (pass)
  CONFIG_NAMESPACES: Namespaces support: built-in (pass)
    CONFIG_UTS_NS: UTS namespace: built-in (pass)
    CONFIG_IPC_NS: IPC namespace: built-in (pass)
    CONFIG_PID_NS: PID namespace: built-in (pass)
    CONFIG_NET_NS: Network namespace: built-in (pass)
  CONFIG_NET: Networking support: built-in (pass)
    CONFIG_INET: TCP/IP networking: built-in (pass)
      CONFIG_IPV6: The IPv6 protocol: built-in (pass)
    CONFIG_NETFILTER: Network packet filtering framework (Netfilter): built-in (pass)
      CONFIG_NETFILTER_ADVANCED: Advanced netfilter configuration: built-in (pass)
      CONFIG_NETFILTER_XTABLES: Netfilter Xtables support: built-in (pass)
        CONFIG_NETFILTER_XT_TARGET_REDIRECT: REDIRECT target support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_COMMENT: "comment" match support: module (pass)
        CONFIG_NETFILTER_XT_MARK: nfmark target and match support: module (pass)
        CONFIG_NETFILTER_XT_SET: set target and match support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_MASQUERADE: MASQUERADE target support: unknown (warning: also tried CONFIG_IP_NF_TARGET_MASQUERADE, CONFIG_IP6_NF_TARGET_MASQUERADE)
        CONFIG_NETFILTER_XT_NAT: "SNAT and DNAT" targets support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: "addrtype" address type match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_CONNTRACK: "conntrack" connection tracking match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_MULTIPORT: "multiport" Multiple port match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_RECENT: "recent" match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_STATISTIC: "statistic" match support: module (pass)
      CONFIG_NETFILTER_NETLINK: module (pass)
      CONFIG_NF_CONNTRACK: Netfilter connection tracking support: module (pass)
      CONFIG_NF_NAT: module (pass)
      CONFIG_IP_SET: IP set support: module (pass)
        CONFIG_IP_SET_HASH_IP: hash:ip set support: module (pass)
        CONFIG_IP_SET_HASH_NET: hash:net set support: module (pass)
      CONFIG_IP_VS: IP virtual server support: module (pass)
        CONFIG_IP_VS_NFCT: Netfilter connection tracking: built-in (pass)
      CONFIG_NF_CONNTRACK_IPV4: IPv4 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_REJECT_IPV4: IPv4 packet rejection: module (pass)
      CONFIG_NF_NAT_IPV4: IPv4 NAT: unknown (warning)
      CONFIG_IP_NF_IPTABLES: IP tables support: module (pass)
        CONFIG_IP_NF_FILTER: Packet filtering: module (pass)
          CONFIG_IP_NF_TARGET_REJECT: REJECT target support: module (pass)
        CONFIG_IP_NF_NAT: iptables NAT support: module (pass)
        CONFIG_IP_NF_MANGLE: Packet mangling: module (pass)
      CONFIG_NF_DEFRAG_IPV4: module (pass)
      CONFIG_NF_CONNTRACK_IPV6: IPv6 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_NAT_IPV6: IPv6 NAT: unknown (warning)
      CONFIG_IP6_NF_IPTABLES: IP6 tables support: module (pass)
        CONFIG_IP6_NF_FILTER: Packet filtering: module (pass)
        CONFIG_IP6_NF_MANGLE: Packet mangling: module (pass)
        CONFIG_IP6_NF_NAT: ip6tables NAT support: module (pass)
      CONFIG_NF_DEFRAG_IPV6: module (pass)
    CONFIG_BRIDGE: 802.1d Ethernet Bridging: module (pass)
      CONFIG_LLC: module (pass)
      CONFIG_STP: module (pass)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: module (pass)
  CONFIG_PROC_FS: /proc file system support: built-in (pass)

What happened?

Kubernetes cronjob cannot access database service in other worker node when network policy is used. It works fine if I run sleep 3 command at first or all pods are running in the same worker node.

I create a database pod and a database service, a network policy to only allow internal namespace traffic and a cronjob in the same namespace to test a valid connection to database but pods launched by cronjob fail without a sleep command. It appears to need a delay because of all iptables rules needed are not created at that moment.

Pods launched by cronjob work fine if network policy is not used or all pods are running in the same worker node

This is a yaml to probe it. I've probe this yaml running minikube (simulating 2 nodes) with kubernetes 1.24.6 and it works fine without a delay in cronjob but it fails on three K0s clusters complete-stack-example.txt

Steps to reproduce

  1. Create a database service and pod running MySQL / MariaDB
  2. Create a network policy to deny all namespace external traffic and allow all namespace internal traffic
  3. Verify database service is running and it's possible to get a valid connection to the database
  4. Create a cronjob to run mysqladmin ping command to database service but not in the same worker node of database pod
  5. Verify pod launched by cronjob is in error status and kubectl logs command shows this error:
    mysqladmin: connect to server at 'mariadb-service' failed
    error: 'Can't connect to MySQL server on 'mariadb-service' (115)'
    Check that mysqld is running and that the socket: '/var/run/mysqld/mysqld.sock' exists!
    
  6. Add a sleep 3 command to run in cronjob before mysqladmin ping command and verify it works fine:
    $ kubectl get po -n test
    NAME                                  READY   STATUS      RESTARTS   AGE
    mariadb-cronjob-27760917-tszxw        0/1     Completed   0          22s
    mariadb-deployment-5845bdb5c8-mhz9x   1/1     Running     0          38s

$ kubectl logs mariadb-cronjob-27760917-tszxw -n test mysqld is alive

$ kubectl get cj mariadb-cronjob -n test -o yaml | grep -A2 containers: containers:

Expected behavior

Communication with database service from cronjobs should work without a delay

Actual behavior

A database pod and a pod created by a cronjob in other worker node can't connect without a sleep command when a network policy exists.

Screenshots and logs

$ kubectl get po -n kube-system
NAME                              READY   STATUS    RESTARTS   AGE
coredns-ddddfbd5c-bd7k6           1/1     Running   0          2d23h
coredns-ddddfbd5c-xnfb6           1/1     Running   0          2d23h
konnectivity-agent-dtbcl          1/1     Running   0          74m
konnectivity-agent-fms7w          1/1     Running   0          74m
konnectivity-agent-k2w9q          1/1     Running   0          74m
kube-proxy-5b5nh                  1/1     Running   0          2d23h
kube-proxy-9k9dk                  1/1     Running   0          2d23h
kube-proxy-zltjb                  1/1     Running   0          2d23h
kube-router-482fp                 1/1     Running   0          78d
kube-router-n54jb                 1/1     Running   0          78d
kube-router-x92j7                 1/1     Running   0          78d
metrics-server-74c967d8d4-hqzsp   1/1     Running   0          2d23h

iptables and firewalld services are disabled in all cluster nodes but iptables modules are loaded:

# systemctl list-unit-files| egrep '(iptables|firewall)'
firewalld.service                          disabled 
iptables.service                           masked   

# lsmod | egrep -i '(nat|conntrack|masquerade)'
nf_conntrack_netlink    49152  0
xt_nat                 16384  59
ip6t_MASQUERADE        16384  2
ip6table_nat           16384  1
ip6_tables             32768  3 ip6table_filter,ip6table_nat,ip6table_mangle
ipt_MASQUERADE         16384  2
xt_conntrack           16384  19
nft_chain_nat          16384  7
nf_nat                 45056  5 ip6table_nat,ip6t_MASQUERADE,ipt_MASQUERADE,xt_nat,nft_chain_nat
nf_conntrack          172032  7 xt_conntrack,nf_nat,ip6t_MASQUERADE,ipt_MASQUERADE,xt_nat,nf_conntrack_netlink,ip_vs
nf_defrag_ipv6         20480  2 nf_conntrack,ip_vs
nf_defrag_ipv4         16384  1 nf_conntrack
nf_tables             180224  1926 nft_compat,nft_counter,nft_chain_nat,nft_limit
nfnetlink              16384  5 nft_compat,nf_conntrack_netlink,nf_tables,ip_set,nfnetlink_log
libcrc32c              16384  5 nf_conntrack,nf_nat,nf_tables,xfs,ip_vs

Additional context

No response

jnummelin commented 1 year ago

Could you share the network policy you use? Would help us to dive into this.

A database pod and a pod created by a cronjob in other worker node can't connect without a sleep command when a network policy exists.

So they eventually are able to connect, right?

Kinda feels like there's unexpectedly long delay applying the network policies. Can you spot anything suspicious in kube-router logs? kube-router is the one enforcing the policies in this case.

jsalgado78 commented 1 year ago

Network policy is included in this file https://github.com/k0sproject/k0s/files/9775557/complete-stack-example.txt This file includes complete yaml needed to probe this issue.

Copy and paste from previous file:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-network-policy
  namespace: test
spec:
  egress:
  - to:
    - podSelector: {}
  - ports:
    - port: 53
      protocol: TCP
    - port: 53
      protocol: UDP
    to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
jnummelin commented 1 year ago

thanks. somehow my 👀 missed it previously :)

jsalgado78 commented 1 year ago

I can see these errors in kube-router logs but these messages were generated some hours before cronjob fail:

E1013 06:26:38.176086       1 network_policy_controller.go:283] Failed to cleanup stale ipsets: failed to delete ipset KUBE-DST-CGZWTH6XDOELKCBP due to ipset v7.15: Set cannot be destroyed: it is in use by a kernel component
E1013 06:30:32.324135       1 network_policy_controller.go:283] Failed to cleanup stale ipsets: failed to delete ipset KUBE-DST-V76DHIHZ37XZJSAY due to ipset v7.15: Set cannot be destroyed: it is in use by a kernel component
E1013 10:42:51.389132       1 network_policy_controller.go:283] Failed to cleanup stale ipsets: failed to delete ipset KUBE-DST-WVVK5VS3NEKJJHZF due to ipset v7.15: Set cannot be destroyed: it is in use by a kernel component
E1013 10:55:28.246109       1 network_policy_controller.go:283] Failed to cleanup stale ipsets: failed to delete ipset KUBE-DST-WVVK5VS3NEKJJHZF due to ipset v7.15: Set cannot be destroyed: it is in use by a kernel component
E1013 11:14:50.063221       1 network_policy_controller.go:283] Failed to cleanup stale ipsets: failed to delete ipset KUBE-DST-H2JNHIQM3YF2AZ3L due to ipset v7.15: Set cannot be destroyed: it is in use by a kernel component

There's no kube-router logs when cronjob fail

github-actions[bot] commented 1 year ago

The issue is marked as stale since no activity has been recorded in 30 days

juanluisvaladas commented 1 year ago

Kubernetes cronjob cannot access database service in other worker node when network policy is used. It works fine if I run sleep 3 command at first or all pods are running in the same worker node.

@jnummelin @jsalgado78 , actually this some delay delay is expected and fairly common in most NetworkPolicy implementations. At least this happens in literally every one of the implementations that I know in detail. This is by design and I don't think this can be fixed because of how NetworkPolicies work. The flow looks like this:

Pod created in the API │ ▼ Pod Scheduled │ ▼ Actual pod creation in the node │ ▼ CNI Call │ ▼ CNI Plugin gets the call and assigns an IP address │ ├─► Pod definition is updated ─►SDN agents watching the API apply the rules in their destination. │ └─►Pod network plumbing is actually created and the controller runtime finishes starting the container

If we assume the constraint that the network is one big distributed switch instead of having a central switch taking care of everything (and you don't want this for performance and cost reasons), the only option would be to block the network plumbing until every destination that applies rules for it is configured. This has a two massive problems:

  1. An affected node could block the deployment of other apps
  2. Quite simply there isn't a good way to handle the case when one of the pods affected by the network policy is being created. Deletions wouldn't be trivial.

So I don't think this a real issue, I would agree if we were talking about a big delay, but 3 seconds seems fairly reasonable to me. If we were talking of a much higher value OK, fair enough, but if it's just 3 seconds I think the application should be able to handle it.

IMO this is acceptable behavior. What do you think @jnummelin ?

jsalgado78 commented 1 year ago

It appears to be rare because I've detected this issue in recent K0s versions, maybe from 1.24.2. It worked fine in previous K0s versions and I can't reproduce this issue in clusters running Mirantis Kubernetes Engine 3.4 (Kubernetes 1.20.x)

juanluisvaladas commented 1 year ago

Are you 100% certain that the issue can be reproduced in 1.24.2 and not in 1.24.6, both using the OS version and hardware specs(to the extent of possible in VMs)? I don't see any significant change that could trigger this:

I'm not saying it's impossible to have a regression but we certainly need to isolate it. Could you please provision a 1.24.2, try to reproduce it, and if it doesn't happen upgrade to 1.24.6 and see if it happens then? Just upgrade k0s, don't upgrade the kernel or any OS package.

jsalgado78 commented 1 year ago

I've just probed several versions of k0s from 1.23 to 1.25.4 (using cni default provider, kube-router) and it fails in all k0s versions when a pod is created by a cronjob and a previous network policy exists but it works fine when a pod is created by a cronjob without a previous network policy.

It works fine in Mirantis Kubernetes Engine 3.4 (MKE uses calico).

It works fine in k0s when cni provider is calico so it's a kube-router's issue.

juanluisvaladas commented 1 year ago

Well the fact that there isn't a regression is good.

The policies are applied just not fast enough which means that Calico, after certain scale, will have the same issue. In fact every networkPolicy implementation that I know behaves this way:

Now being honest, Kube-router applies the networkpolicies in a pretty naive way which can be optimized in many ways, and which compared to other implementations is just slow. There are a fair amount of optimizations that could be performed in the code (I'm saying it can be done, not that it's easy).

Now the questions are:

  1. How much delay are we willing to accept at which scale. I consider that 3 seconds is something the application should be able to handle (add a few retries and not just a sleep 3 because now it's half second but tomorrow may be 4 seconds).
  2. Is this consuming a reasonable amount of CPU? I know you didn't bring this topic but I can see the cpu consumption as a potential issue just by reading the code.
jsalgado78 commented 1 year ago

I've use a workaround, with an init container in cronjob. This init container resolve database service name in a loop before launch containers connecting to database. A unique execution of nslookup in init container is enough

github-actions[bot] commented 1 year ago

The issue is marked as stale since no activity has been recorded in 30 days

jnummelin commented 1 year ago

@jsalgado78 I don't think there's anything we can do about this in k0s side. kube-router has somewhat naive way how it operates the iptables rules for NPC, which is know on their side too. There's one issue in kube-router side to improve the handling of the NPC rules, see https://github.com/cloudnativelabs/kube-router/issues/1372

I'm closing this in favour of tracking the kube-router NPC stuff upstream and as there's nothing (known) which we can do at k0s side.