Container networking is broken on hosts with default-deny iptables rules after upgrading to v1.26.3

mgabeler-lee-6rs commented 1 year ago

Environmental Info: K3s Version:

k3s version v1.26.3+k3s1 (01ea3ff2)
go version go1.19.7

Node(s) CPU architecture, OS, and Version: Linux CENSORED 6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1 (2023-03-19) x86_64 GNU/Linux

Cluster Configuration:

Local development environment using k3s

Describe the bug:

After upgrading to k3s 1.26.x (version above) from 1.25.x, nothing would come up, even after wiping all k3s config data and starting a fresh cluster. After digging into logs, the issue traced backwards from "coredns not responding" to "coredns stuck waiting on kubernetes service" to the kubernetes service failing to be initialized on the first cluster start attempt, and never attempting repair thereafter:

E0403 15:02:19.425921 1727565 controller.go:156] Unable to perform initial Kubernetes service initialization: Service "kubernetes" is invalid: spec.clusterIPs: Invalid value: []string{"10.43.0.1"}: failed to allocate IP 10.43.0.1: cannot allocate resources of type serviceipallocations at this time

Steps To Reproduce:

Installed K3s: Download binary from github and drop in /usr/local/bin
Start k3s: sudo k3s server --write-kubeconfig-mode 644 --docker --kube-apiserver-arg=service-node-port-range=1024-32767 --tls-san=0.0.0.0

Expected behavior: It should be able to start a cluster

Actual behavior: It fails to start the cluster

Additional context / logs:

Call-out that, due to developer environment, we are using --docker instead of containerd. Unless --docker is no longer supported, please don't just say "you should use containerd / nerdctl". We've evaluated that, and it is not an easy replacement for our workflows & machine setups right now.

brandond commented 1 year ago

E0403 15:02:19.425921 1727565 controller.go:156] Unable to perform initial Kubernetes service initialization: Service "kubernetes" is invalid: spec.clusterIPs: Invalid value: []string{"10.43.0.1"}: failed to allocate IP 10.43.0.1: cannot allocate resources of type serviceipallocations at this time

You will see this in absolutely every K3s server startup log ever. This always happens during initial cluster startup, and is resolved within miliseconds once the rest of the controllers are initialized.

--tls-san=0.0.0.0

This isn't a valid TLS SAN; you will never connect to a node using the IP 0.0.0.0.

Please attach the complete K3s service log, as well as example commands and output showing whatever errors you're encountering. The information you've provided here doesn't provide enough detail to actual discern what's going on with your environment,

mgabeler-lee-6rs commented 1 year ago

--tls-san=0.0.0.0

This isn't a valid TLS SAN; you will never connect to a node using the IP 0.0.0.0.

This is something I got copied from somewhere to allow the fake cert to be accepted for different IP addresses ... I forget where I got it from, and Google is failing me right now. Removing it doesn't help (I wiped the k8s environment to start fresh just to make sure).

You will see this in absolutely every K3s server startup log ever. This always happens during initial cluster startup, and is resolved within miliseconds once the rest of the controllers are initialized.

OK, I thought it wasn't resolved, but I was looking for the service in the kube-system namespace instead of the default namespace.

Please attach the complete K3s service log, as well as example commands and output showing whatever errors you're encountering. The information you've provided here doesn't provide enough detail to actual discern what's going on with your environment,

:+1: k3s.log coredns.log local-path-provisioner.log

Everything else I see failing seems to boil down to coredns and/or local-path-provisioner failing to start, and those seem to be failing to start because they're timing out trying to contact the main kubernetes api endpoint. At least that is notably different in the logs for those services vs. the logs from a working machine running 1.25.x -- the working machine fails to contact the api endpoint at startup, but only temporarily before that stops and they work normally.

I can hit https://<my-local-ip>:6443 which seems to be where that service points at, so I'm not sure what's going on with the pods.

All my other pod failures at this point are DNS failures trying to talk to the coredns pod that isn't up

brandond commented 1 year ago

Are you running k3s as a user, instead of as a systemd service? There are some odd errors in the logs about cgroups and missing containers:

E0403 15:55:39.202799 1796046 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = Unknown desc = Error: No such container: 66c56e2d3eed4b340690cfd59d48c7a5a61401a554aa92f9c0d95af7dee48357" containerID="66c56e2d3eed4b340690cfd59d48c7a5a61401a554aa92f9c0d95af7dee48357"
I0403 15:55:39.202806 1796046 volume_manager.go:293] "Starting Kubelet Volume Manager"
I0403 15:55:39.202814 1796046 kuberuntime_gc.go:362] "Error getting ContainerStatus for containerID" containerID="66c56e2d3eed4b340690cfd59d48c7a5a61401a554aa92f9c0d95af7dee48357" err="rpc error: code = Unknown desc = Error: No such container: 66c56e2d3eed4b340690cfd59d48c7a5a61401a554aa92f9c0d95af7dee48357"
I0403 15:55:39.202844 1796046 desired_state_of_world_populator.go:151] "Desired state populator starts to run"
E0403 15:55:39.203485 1796046 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = Unknown desc = Error: No such container: b1ebd4dd0190bd5a5bc3bf9c2a6e42724e2b17d19f109d2c0f5183609b3c068e" containerID="b1ebd4dd0190bd5a5bc3bf9c2a6e42724e2b17d19f109d2c0f5183609b3c068e"
I0403 15:55:39.203504 1796046 kuberuntime_gc.go:362] "Error getting ContainerStatus for containerID" containerID="b1ebd4dd0190bd5a5bc3bf9c2a6e42724e2b17d19f109d2c0f5183609b3c068e" err="rpc error: code = Unknown desc = Error: No such container: b1ebd4dd0190bd5a5bc3bf9c2a6e42724e2b17d19f109d2c0f5183609b3c068e"
E0403 15:55:39.204177 1796046 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = Unknown desc = Error: No such container: 61acb17030e060ee20b9e794fa848dfa68ffe1e6f56eefce3b86cf796da4bbce" containerID="61acb17030e060ee20b9e794fa848dfa68ffe1e6f56eefce3b86cf796da4bbce"
I0403 15:55:39.204195 1796046 kuberuntime_gc.go:362] "Error getting ContainerStatus for containerID" containerID="61acb17030e060ee20b9e794fa848dfa68ffe1e6f56eefce3b86cf796da4bbce" err="rpc error: code = Unknown desc = Error: No such container: 61acb17030e060ee20b9e794fa848dfa68ffe1e6f56eefce3b86cf796da4bbce"
E0403 15:55:39.204395 1796046 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = Unknown desc = Error: No such container: b998fc83bb0c3d68ecf09aeead4adbacdf0af5bdebd8489fbab0ef2f30dd381f" containerID="b998fc83bb0c3d68ecf09aeead4adbacdf0af5bdebd8489fbab0ef2f30dd381f"
I0403 15:55:39.204406 1796046 kuberuntime_gc.go:362] "Error getting ContainerStatus for containerID" containerID="b998fc83bb0c3d68ecf09aeead4adbacdf0af5bdebd8489fbab0ef2f30dd381f" err="rpc error: code = Unknown desc = Error: No such container: b998fc83bb0c3d68ecf09aeead4adbacdf0af5bdebd8489fbab0ef2f30dd381f"
E0403 15:55:39.205004 1796046 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = Unknown desc = Error: No such container: 0a643f81c5a469086a00889c11d125529bb0dba91a6a6115d3c92e01b6c0eed7" containerID="0a643f81c5a469086a00889c11d125529bb0dba91a6a6115d3c92e01b6c0eed7"
I0403 15:55:39.205021 1796046 kuberuntime_gc.go:362] "Error getting ContainerStatus for containerID" containerID="0a643f81c5a469086a00889c11d125529bb0dba91a6a6115d3c92e01b6c0eed7" err="rpc error: code = Unknown desc = Error: No such container: 0a643f81c5a469086a00889c11d125529bb0dba91a6a6115d3c92e01b6c0eed7"

K3s as a whole seems to be running in your user slice instead of in a dedicated slice for a systemd service unit?

I0403 15:55:39.312502 1796046 container_manager_linux.go:626] "Failed to ensure state" containerName="/k3s" err="failed to find container of PID 1796046: cpu and memory cgroup hierarchy not unified.  cpu: /user.slice, memory: /user.slice/user-1000.slice/user@1000.service"

mgabeler-lee-6rs commented 1 year ago

Yes, running this "by hand", but still as root (sudo k3s ...). This used to work OK, and fits a little better in our developer environments as we can more directly manage it as part of the rest of the dev stack. Is that no longer a viable option as of 1.26?

brandond commented 1 year ago

I'm not sure that its strictly related, but it probably isn't great either. If you've got docker using the systemd cgroup manager, the kubelet will want to do the same, but it can't because it's all running under your user slice instead of in a dedicated system slice. Kubernetes in general is moving heavily towards using the systemd cgroup manager and cgroupv2, and just running k3s via sudo from a shell does not allow it to do that.

Have you confirmed that you don't have any iptables rules (managed via ufw/firewalld/etc) that might be interfering with things?

mgabeler-lee-6rs commented 1 year ago

We can't run cgroups v2 because our workload requires the ability to run Ubuntu 18.04 os-containers, and that doesn't have a new enough version of systemd to work that way.

I checked iptables rules, yes, and the only ones that stuck out as different between the working and non-working systems were some REJECTs in KUBE-SERVICES ... for the coredns and other services that weren't ready yet due to the kube api server issue, annotated as "has no endpoints", so that part made sense.

I'll try running k3s under a normal systemd unit and see if it makes things better.

mgabeler-lee-6rs commented 1 year ago

That didn't help, here's a fresh log from that: k3s-systemd.log

mgabeler-lee-6rs commented 1 year ago

I also tried switching away from --docker just to try to isolate issues, it didn't help

brandond commented 1 year ago

the only ones that stuck out as different between the working and non-working systems were some REJECTs in KUBE-SERVICES ... for the coredns and other services that weren't ready yet due to the kube api server issue, annotated as "has no endpoints", so that part made sense.

Can you show the output of kubectl get service,endpoints,networkpolicy -A -o wide ?

If you're dealing with a single-node cluster, the only thing that might be interfering with access to the in-cluster kubernetes endpoint would be something else inserting conflicting iptables rules, or perhaps otherwise mucking with the container network interfaces.

mgabeler-lee-6rs commented 1 year ago

If you're dealing with a single-node cluster, the only thing that might be interfering with access to the in-cluster kubernetes endpoint would be something else inserting conflicting iptables rules, or perhaps otherwise mucking with the container network interfaces.

This was my thought too, but I was struggling with how to see the iptables rules that applied within the network namespaces. Esp. since the coredns image doesn't even have /bin/sh. I guess I'd need to create a pod running in privileged mode / cap_net_admin enabled for the iptables binary to be able to see things? would that be helpful?

Can you show the output of kubectl get service,endpoints,networkpolicy -A -o wide ?

E0403 17:45:08.289309 1924749 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0403 17:45:08.294308 1924749 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0403 17:45:08.295147 1924749 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0403 17:45:08.298071 1924749 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0403 17:45:08.299441 1924749 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0403 17:45:08.300804 1924749 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0403 17:45:08.302003 1924749 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0403 17:45:08.302932 1924749 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAMESPACE     NAME                     TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGE   SELECTOR
default       service/kubernetes       ClusterIP   10.43.0.1     <none>        443/TCP                  20s   <none>
kube-system   service/kube-dns         ClusterIP   10.43.0.10    <none>        53/UDP,53/TCP,9153/TCP   17s   k8s-app=kube-dns
kube-system   service/metrics-server   ClusterIP   10.43.91.45   <none>        443/TCP                  16s   k8s-app=metrics-server

NAMESPACE     NAME                       ENDPOINTS         AGE
default       endpoints/kubernetes       10.0.0.174:6443   20s
kube-system   endpoints/kube-dns         <none>            5s
kube-system   endpoints/metrics-server                     5s

brandond commented 1 year ago

the CNI and kubelet managed iptables rules are all in the host namespace, I wouldn't expect you to see any in the pods.

What version of Debian and Docker are you running? Have you customized the system configuration in any particularly interesting ways? What do you see if you run ip addr in a pod?

I regularly run k3s on Ubuntu 22.04 (its my primary development OS) both with containerd and with docker and have not seen any issues, nor have I seen anyone else report similar issues, so I suspect there's something else going on with your OS configuration.

mgabeler-lee-6rs commented 1 year ago

The host system is Debian bookworm (almost ready new stable release, but technically still "testing"). I'll try on an Ubuntu 22.04 VM to see if that makes a difference. Docker is 20.10.23+dfsg1 (from the debian docker.io package). Containerd is containerd github.com/containerd/containerd 1.6.18~ds1 1.6.18~ds1-1+b2 in case that's relevant.

I haven't customized this system much, no, installing k3s is the most "interesting" thing I've done to it regarding networking setups. I have these saved iptables rules for a basic "anything outgoing, nothing incoming except LAN subnets" config, and the KUBE rules all get added ahead of this in iptables:

*filter
:INPUT DROP
:FORWARD DROP
-A INPUT -i lo -j ACCEPT
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A INPUT -i wg+ -j ACCEPT
-A INPUT -s 10.0.0.0/8 -i wl+ -j ACCEPT
-A INPUT -s 10.0.0.0/8 -i en+ -j ACCEPT
-A INPUT -i docker+ -j ACCEPT
-A INPUT -i br-+ -j ACCEPT
COMMIT

Indeed creating a privileged pod saw nothing for iptables. ip addr info from that pod:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0@if106: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether ae:c0:bd:8e:dd:6c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.42.0.10/24 brd 10.42.0.255 scope global eth0
       valid_lft forever preferred_lft forever

mgabeler-lee-6rs commented 1 year ago

Tried a few more things to isolate the issue:

Reboot the system, just coz: nope
Downgrade to current 1.25.x release (1.25.8+k3s1): nope
Dig through my logs and downgrade to the last 1.25.x version I actually used, which was 1.25.7+k3s1: !! works !!

So, there's a much smaller range to bisect here of where things might have gone wrong

Edit: continuing the bisect, 1.25.8-rc1+k3s1 is also showing the problem, so I guess further bisecting will require building from source, which I think I can do

brandond commented 1 year ago

1.25.7+k3s1: !! works !! 1.25.7-rc1+k3s1 is also showing the problem

There was no second RC of 1.25.7+k3s1, so these tags point to the same commit...

commit f7c20e237d0ad0eae83c1ce60d490da70dbddc0e (tag: v1.25.7-rc1+k3s1, tag: v1.25.7+k3s1)
Author: Matt Trachier <matt.trachier@suse.com>
Date:   Wed Mar 1 15:29:10 2023 -0600

    Update to v1.25.7-k3s1 (#7010)

    * Update to v1.25.7
    * update gh workflows and docker files to proper go version
    ---------
    Signed-off-by: matttrach <matttrach@gmail.com>

mgabeler-lee-6rs commented 1 year ago

~building from source is erroring out with some checks in scripts/version.sh~ Edit: figured out what was going wrong here and worked around it

There was no second RC of 1.25.7+k3s1, so these tags point to the same commit...

git log shows me:

commit 6c5ac02248834a4d59501f7f31404d1287e358db (tag: v1.25.8-rc2+k3s1, tag: v1.25.8+k3s1)
Author: Roberto Bonafiglia <roberto.bonafiglia@suse.com>
Date:   Wed Mar 22 15:50:03 2023 +0100

    Update flannel to fix NAT issue with old iptables version

    Signed-off-by: Roberto Bonafiglia <roberto.bonafiglia@suse.com>

Edit: sorry, had a typo, it was the 1.25.8-rc1 that failed for me, 1.25.7 is passing.

brandond commented 1 year ago

The primary differences likely to impact you are updates to flannel and kube-router, the rest of the stuff in there isn't going to make much difference.

Just out of curiosity, you might try starting k3s with --prefer-bundled-bin, on the off chance there are some problems with the version of iptables your hosts have?

mgabeler-lee-6rs commented 1 year ago

Just out of curiosity, you might try starting k3s with --prefer-bundled-bin, on the off chance there are some problems with the version of iptables your hosts have?

Tried with v1.25.8-rc1, didn't help.

The primary differences likely to impact you are updates to flannel and kube-router, the rest of the stuff in there isn't going to make much difference.

My thought too, so I started my bisect with that first PR after 1.25.7 (#7061, building from commit f5d1f976d3727f2a62ea536dca91e0acebf98bdf ... but that fails to bring up the node, I think something must be wrong with the build. Keeps logging this in the k3s output:

time="2023-04-03T18:55:03-04:00" level=info msg="Waiting to retrieve agent configuration; server is not ready: failed to find host-local: exec: \"host-local\": executable file not found in $PATH"

Somehow it never setup /var/lib/rancher/k3s/data when running the source build ... I think I need to run make package-cli instead of make in order to generate the fully bundled binary?

brandond commented 1 year ago

If you're going to try to build from source, I would recommend doing so on a host with Docker, and just do git clean -xffd && SKIP_VALIDATE=true make ci

brandond commented 1 year ago

Try the prefer-bundled-bin flag with a recent release before you go building stuff. Very little of the stuff you're poking at is in k3s itself, it's likely somewhere in one of the flannel module updates.

mgabeler-lee-6rs commented 1 year ago

@brandond I did try that with the earliest tagged release that's failing for me, it didn't help.

mgabeler-lee-6rs commented 1 year ago

v1.25.8-rc1 and f5d1f976d3727f2a62ea536dca91e0acebf98bdf (first commit after 1.25.7) both fail with this error, including with --prefer-bundled-bin

Just to validate my source builds, v1.25.7 built from source, like the copy downloaded from github, works

trying downgrading flannel and/or kube-router to see if I can isolate it to one of those dependencies

mgabeler-lee-6rs commented 1 year ago

v1.26.3+k3s1 on a clean ubuntu 22.04 vm works same on a clean debian bookworm vm: works ... so it's something with my local system

Starting from v1.25.8+k3s1:

downgrading flannel didn't help anything
downgrading kube-router (to v1.5.2-0.20221026101626-e01045262706 from before #7061) does make things work

brandond commented 1 year ago

Check your general system logs, do you have anything that's mucking about with the docker container interfaces when they are added? I have seen odd behavior from avahi adding multicast listeners to container interfaces, for example.

downgrading kube-router (to v1.5.2-0.20221026101626-e01045262706 from before https://github.com/k3s-io/k3s/pull/7061) does make things work

Can you try running k3s with --disable-network-policy?

cc @rbrtbnfgl @thomasferrandiz - this may be more weirdness with the new v2.0.0 release of kube-router. As per the above output there are no network policies in place, but for some reason pods can't reach the in-cluster Kubernetes service endpoint.

mgabeler-lee-6rs commented 1 year ago

I have seen odd behavior from avahi adding multicast listeners to container interfaces, for example.

I do have avahi running, but it's also on the "clean vm". The problem workstation has it configured to not listen on the docker interfaces, whereas the clean vm has the default config where it does listen there.

Can you try running k3s with --disable-network-policy?

Tried this back on v1.26.3, no luck.

Check your general system logs, do you have anything that's mucking about with the docker container interfaces when they are added?

Rummaging ...

brandond commented 1 year ago

Can you try running k3s with --disable-network-policy?

Tried this back on v1.26.3, no luck.

Hmm. All you did was revert the kube-router version and it works, but you can't use k3s with the updated kube-router even if it's disabled? That doesn't make any sense to me, if you disable it we don't run any of the affected code. Period. Maybe try disabling it on a fresh install/reboot, on the off chance there are some rules being left behind?

Perhaps spend a bit more time trying to figure out what about your machine makes it unique from the other nodes you were unable to reproduce on?

mgabeler-lee-6rs commented 1 year ago

Can you try running k3s with --disable-network-policy?

Tried this back on v1.26.3, no luck.

Hmm. All you did was revert the kube-router version and it works, but you can't use k3s with the updated kube-router even if it's disabled?

On v1.25.8, yes. It also required reverting the associated changes to pkg/agent/netpol/netpol.go, but that seems like it should be uninteresting in this context.

That doesn't make any sense to me, if you disable it we don't run any of the affected code. Period. Maybe try disabling it on a fresh install/reboot, on the off chance there are some rules being left behind?

Will check that, yeah. I rebooted once in this process to check that, but haven't since. Each time I switch versions I am fully stopping k3s, and stopping/deleting all the pods & associated containers, and doing all the "rm -rf" type stuff the uninstall script does, so it's a pretty fresh start each time.

Perhaps spend a bit more time trying to figure out what about your machine makes it unique from the other nodes you were unable to reproduce on?

Yep, I'm working on this in the background here, trying to disable various services & configurations & such to try and get to a working state and then turn things back on one by one.

I would also love to find out where the packets destined for the api server are going / getting dropped. Starting on that road, wireshark on the host listening to cni0 doesn't see any SYN packets coming in to 10.42.0.1 when just the coredns pod is running.

But, if I start a random other pod with a usable shell (I started a jobs.batch just running debian:stable with a sleep so I can exec in), work out which veth... interface corresponds to it, and have it try to connect to that same ip/port, I can see the packets in wireshark on both the veth... and cni0 interfaces. Still no responses.

So I added a final rule to my iptables INPUT chain to log everything, and I see these coming through:

IN=cni0 OUT= MAC=4a:5e:56:e2:d6:a4:1e:e2:6d:be:07:dd:08:00 SRC=10.42.0.2 DST=10.0.0.174 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=24507 DF PROTO=TCP SPT=60064 DPT=6443 WINDOW=64860 RES=0x00 SYN URGP=0

And part of this finally clicks. This workstation, because it does have some firewalling enabled (corporate policies) has the INPUT chain configured with a default policy of DROP. It has rules to accept local traffic from the "normal" interfaces, but cni0 is not part of those rules. The "clean VM" isn't corporate-ized and so the INPUT chain's default policy is ACCEPT

So now the question is, what's different about the iptables rules on the different versions that causes things to be ACCEPTed somewhere before this on the old version but fall through to the DROP rule here. I'm going to wager that it somehow is the interface name, and that on the old version the packets "appear" on eno2 (my default network interface, which holds the 10.0.0.174 address above), instead of cni0. I expect I can work around this by adding an accept rule for cni0, but I'd like to understand more of the why since it may need a more robust workaround/fix for my coworkers.

I'm at the end of my day today, I will follow up with findings/results tomorrow.

brandond commented 1 year ago

And part of this finally clicks. This workstation, because it does have some firewalling enabled (corporate policies) has the INPUT chain configured with a default policy of DROP.

See, that's what I was fishing for when I asked:

Have you confirmed that you don't have any iptables rules (managed via ufw/firewalld/etc) that might be interfering with things?

mgabeler-lee-6rs commented 1 year ago

See, that's what I was fishing for when I asked:

Yeah, just took me a moment to see it, been a little while since I poked at this stuff and forgot about the chain policy vs. having a default-drop rule at the end of the chain.

Digging further into the networking, it's sort of cni0 stuff. Something in the upgrade to kube-router caused the iptables rules created to change in a way that make the new setup dependent on the host having a default-accept rule here.

Comparing iptables-save from 1.25.7 vs 1.25.8, the only thing that immediately jumps out at me is a change in placement of the -j FLANNEL-FWD in the FORWARD table, and a similar change to the -j FLANNEL-POSTRTG in nat/POSTROUTING. The former move from being early, between KUBE-ROUTER-FORWARD and KUBE-PROXY-FIREWALL rules, to being the very last rule in the FORWARD chain. Similarly the latter moved from being between CNI-HOSTPORT-MASQ and KUBE-POSTROUTING to being the last rule in the chain.

Since that didn't immediately make things obvious, I used the nft trace system to gather more detailed data.

Key part of the trace for a packet on 1.25.7:

trace id 85f6def5 ip filter INPUT packet: iif "cni0" ether saddr 8a:e5:38:16:ba:8d ether daddr 6e:bc:51:89:9d:05 ip saddr 10.42.0.6 ip daddr 10.0.0.174 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 64261 ip length 52 tcp sport 39524 tcp dport 6443 tcp flags == ack tcp window 710 
trace id 85f6def5 ip filter INPUT rule  counter packets 9387 bytes 7334806 jump KUBE-ROUTER-INPUT (verdict jump KUBE-ROUTER-INPUT)
trace id 85f6def5 ip filter KUBE-ROUTER-INPUT rule ip saddr 10.42.0.6  counter packets 935 bytes 305298 jump KUBE-POD-FW-RBX4FCSO3CUKYMCM (verdict jump KUBE-POD-FW-RBX4FCSO3CUKYMCM)
trace id 85f6def5 ip filter KUBE-POD-FW-RBX4FCSO3CUKYMCM rule  ct state related,established counter packets 2094 bytes 444756 accept (verdict accept)

Same portion of the trace for a packet on 1.25.8:

trace id 4498137c ip filter INPUT packet: iif "cni0" ether saddr d6:92:90:b0:15:1a ether daddr 76:c1:27:59:c9:eb ip saddr 10.42.0.6 ip daddr 10.0.0.174 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 4315 ip length 60 tcp sport 43500 tcp dport 6443 tcp flags == syn tcp window 64860 
trace id 4498137c ip filter INPUT rule  counter packets 3632 bytes 2970461 jump KUBE-ROUTER-INPUT (verdict jump KUBE-ROUTER-INPUT)
trace id 4498137c ip filter KUBE-ROUTER-INPUT rule ip saddr 10.42.0.6  counter packets 4 bytes 240 jump KUBE-POD-FW-5LYEVMXKAM6TCRT3 (verdict jump KUBE-POD-FW-5LYEVMXKAM6TCRT3)
trace id 4498137c ip filter KUBE-POD-FW-5LYEVMXKAM6TCRT3 rule ip saddr 10.42.0.6  counter packets 4 bytes 240 jump KUBE-NWPLCY-DEFAULT (verdict jump KUBE-NWPLCY-DEFAULT)
trace id 4498137c ip filter KUBE-NWPLCY-DEFAULT rule  counter packets 21 bytes 1296 meta mark set mark or 0x10000 (verdict continue)
trace id 4498137c ip filter KUBE-NWPLCY-DEFAULT verdict continue meta mark 0x00010000 
... runs through inapplicable rules for other pods/etc
trace id 4498137c ip filter KUBE-ROUTER-INPUT verdict return meta mark 0x00020000 
trace id 4498137c ip filter INPUT rule ct state new  counter packets 94 bytes 16173 jump KUBE-PROXY-FIREWALL (verdict jump KUBE-PROXY-FIREWALL)
trace id 4498137c ip filter KUBE-PROXY-FIREWALL verdict continue meta mark 0x00020000 
trace id 4498137c ip filter INPUT rule  counter packets 3498 bytes 2960971 jump KUBE-NODEPORTS (verdict jump KUBE-NODEPORTS)
trace id 4498137c ip filter KUBE-NODEPORTS verdict continue meta mark 0x00020000 
trace id 4498137c ip filter INPUT rule ct state new  counter packets 94 bytes 16173 jump KUBE-EXTERNAL-SERVICES (verdict jump KUBE-EXTERNAL-SERVICES)
trace id 4498137c ip filter KUBE-EXTERNAL-SERVICES verdict continue meta mark 0x00020000 
trace id 4498137c ip filter INPUT rule counter packets 3498 bytes 2960971 jump KUBE-FIREWALL (verdict jump KUBE-FIREWALL)
trace id 4498137c ip filter KUBE-FIREWALL verdict continue meta mark 0x00020000

Since this mentions the network policy, I ran the same thing again with 1.25.8, but with --disable-network-policy:

trace id d036fa9d ip filter INPUT packet: iif "cni0" ether saddr 6e:db:71:27:fd:05 ether daddr 46:67:7e:96:d8:9d ip saddr 10.42.0.2 ip daddr 10.0.0.174 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 3938 ip length 60 tcp sport 52258 tcp dport 6443 tcp flags == syn tcp window 64860 
trace id d036fa9d ip filter INPUT rule ct state new  counter packets 101 bytes 7292 jump KUBE-PROXY-FIREWALL (verdict jump KUBE-PROXY-FIREWALL)
trace id d036fa9d ip filter KUBE-PROXY-FIREWALL verdict continue 
trace id d036fa9d ip filter INPUT rule  counter packets 8921 bytes 6841389 jump KUBE-NODEPORTS (verdict jump KUBE-NODEPORTS)
trace id d036fa9d ip filter KUBE-NODEPORTS verdict continue 
trace id d036fa9d ip filter INPUT rule ct state new  counter packets 101 bytes 7292 jump KUBE-EXTERNAL-SERVICES (verdict jump KUBE-EXTERNAL-SERVICES)
trace id d036fa9d ip filter KUBE-EXTERNAL-SERVICES verdict continue 
trace id d036fa9d ip filter INPUT rule counter packets 11552 bytes 7956318 jump KUBE-FIREWALL (verdict jump KUBE-FIREWALL)
trace id d036fa9d ip filter KUBE-FIREWALL verdict continue 
trace id d036fa9d ip filter INPUT rule counter packets 25 bytes 1500 log prefix "input drop " (verdict continue)
trace id d036fa9d ip filter INPUT verdict continue 
trace id d036fa9d ip filter INPUT policy drop

So, disabling the policy does remove a bunch of the stuff from the iptables rules, but it doesn't revert things fully to the old state. I think the with-network-policy ruleset has nearly what it should, but it seems the accept rule in some chains isn't right?

From the end of the KUBE-ROUTER-INPUT chain in 1.25.7:

-A KUBE-ROUTER-INPUT -m comment --comment "rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j ACCEPT

The same rule from 1.25.8:

-A KUBE-ROUTER-INPUT -m comment --comment "rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j RETURN

The latter says it's to ACCEPT, but it actually does RETURN, and I think this is, at the final end of the day, the crux of the issue?

At least this now easily matches some code, and I can git blame this down to a specific commit/PR: #50, which was listed to fix #6691. Walking through linked items I came to this comment which seems relevant: https://github.com/cloudnativelabs/kube-router/issues/1453#issuecomment-1493595260

rbrtbnfgl commented 1 year ago

We changed kube-router behaviour to not ACCEPT by default the packets. @brandond is the user that has to configure the chain properly disabling any firewall on the node?

brandond commented 1 year ago

@rbrtbnfgl a couple questions:

Are you saying we've patched our version of kube-router to RETURN when upstream's current position is that they should ACCEPT at the end of the chain? Even though it's fixing something else, it is clearly a breaking change for some users, and we should have been more transparent about that in the release notes.
Can we confirm that the KUBE-ROUTER-INPUT chain is cleaned up properly when the network policy controller is disabled? It feels like we should ensure its absence when the NPC is disabled, as opposed it leaving it around with rules that might interfere with normal operation of the node.

Edit: To answer my first question, I see the the commit at https://github.com/k3s-io/kube-router/commit/df90811446a19e1922a4d7faa226d926b476b0ae changes this. We need to more explicitly call these things out if we're going to change them, we can't hide this in a "version bump". Also, it feels like the comment on that rule needs to be updated; it still claims to be accepting.

I think this is probably a fine change for us to keep, we just need to be better about exposing this sort of stuff in our release notes.

rbrtbnfgl commented 1 year ago

Yes maybe my commit message could have been much more explanatory.

ShylajaDevadiga commented 1 year ago

Commit id from master branch 027cc187ce9f21157b8d37d62e67ee1c42968b4b

Environment Details

Infrastructure Cloud EC2 instance

Node(s) CPU architecture, OS, and Version: Ubuntu 20.04

Cluster Configuration: Single node

Config.yaml:

cat /etc/rancher/k3s/config,yaml

Steps to reproduce the issue

Install k3s with default config

curl -fL https://get.k3s.io| INSTALL_K3S_VERSION=v1.25.8+k3s1 sh -s - server

Check iptables rules with install and upgrade scenario

Results from reproducing on v1.25.8+k3s1:

$ k3s -v
k3s version v1.25.8+k3s1 (6c5ac022)

$ sudo iptables-save |grep network |grep ROUTER
-A KUBE-ROUTER-FORWARD -m comment --comment "rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j RETURN
-A KUBE-ROUTER-INPUT -m comment --comment "rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j RETURN
-A KUBE-ROUTER-OUTPUT -m comment --comment "rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j RETURN

Results from commit on master branch:

$ sudo iptables-save |grep network |grep ROUTER
-A INPUT -m comment --comment "KUBE-ROUTER rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j ACCEPT
-A FORWARD -m comment --comment "KUBE-ROUTER rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j ACCEPT
-A OUTPUT -m comment --comment "KUBE-ROUTER rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j ACCEPT

Results from upgrade from v1.25.8+k3s1 to commit on master branch:

$ sudo iptables-save |grep network |grep ROUTER
-A INPUT -m comment --comment "KUBE-ROUTER rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j ACCEPT
-A FORWARD -m comment --comment "KUBE-ROUTER rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j ACCEPT
-A OUTPUT -m comment --comment "KUBE-ROUTER rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j ACCEPT
-A KUBE-ROUTER-FORWARD -m comment --comment "rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j RETURN
-A KUBE-ROUTER-INPUT -m comment --comment "rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j RETURN
-A KUBE-ROUTER-OUTPUT -m comment --comment "rule to explicitly ACCEPT traffic that comply to network policies" -m mark --mark 0x20000/0x20000 -j RETURN

brandond commented 1 year ago

@mgabeler-lee-6rs are you able to test K3s as installed from: curl -sL get.k3s.io | INSTALL_K3S_COMMIT=027cc187ce9f21157b8d37d62e67ee1c42968b4b sh -s -

I'd like to confirm that the fix from https://github.com/k3s-io/kube-router/pull/56 fixes your use case.

mgabeler-lee-6rs commented 1 year ago

I should be able to test that tomorrow, yes :+1:

HOSTED-POWER commented 1 year ago

Unfortunately it doesn't seem to work here. I'll attach full iptables output (Going to 1.25.7 and it works immediately)

iptbables-debug.txt

rbrtbnfgl commented 1 year ago

The issue is your LOGDROPIN chain that it drops everything. The rule to ACCEPT is in place but is executed after that rule.

HOSTED-POWER commented 1 year ago

ok but how to fix? Before it worked.

I have no control over most of the rules since they are generated automatically, however I can add manual rules with a pre and post script...

rbrtbnfgl commented 1 year ago

Could you try to use -vnL to check the counter of the matched packet?

HOSTED-POWER commented 1 year ago

iptbables-debug.txt

rbrtbnfgl commented 1 year ago

it's how I suspected the LOGDROPIN chain is dropping everything. Why using a chain to drop the packets when the INPUT default policy is DROP? You were lucky that it was working before on your setup if you use kubeadm or RKE2 to setup kubernetes it wouldn't have worked too.

HOSTED-POWER commented 1 year ago

I have no idea, it's a commonly used firewall software script that generates this.

I can make pre/post rules: https://tecadmin.net/add-custom-iptables-rules-with-csf/

But I'd prefer a well supported and simple manner to open the ports.

BTW I created this as a post script, but it seems a lot and I'm afraid something might still be broken:

iptables -I INPUT -s 10.43.0.0/16 -j ACCEPT iptables -I INPUT -s 10.42.0.0/16 -j ACCEPT iptables -I INPUT -d 10.43.0.0/16 -j ACCEPT iptables -I INPUT -d 10.42.0.0/16 -j ACCEPT iptables -I OUTPUT -d 10.43.0.0/16 -j ACCEPT iptables -I OUTPUT -d 10.42.0.0/16 -j ACCEPT iptables -I OUTPUT -s 10.43.0.0/16 -j ACCEPT iptables -I OUTPUT -s 10.42.0.0/16 -j ACCEPT

rbrtbnfgl commented 1 year ago

K3s can't manage any possible configuration that a user could do on the node. I think adding those rules at the begin of the chain could interfere with the kube-proxy work. The firewall script has to be changed properly to accept the traffic on the needed ports.

HOSTED-POWER commented 1 year ago

It's been working for a long long time, I suffered with docker from time to time and I was delighted the k3s worked so well.

Can't you add the old behavior with some flag? It was perfect for us :)

rbrtbnfgl commented 1 year ago

You could try to configure that the script of the firewall is started after K3s. The iptables rules of K3s should be added before the rules created by the firewall.

HOSTED-POWER commented 1 year ago

It wouldn't work since the configserver rules are (auto) reloaded from time to time. If k3s provided a script, that could be called BEFORE the other rules are loaded.

rbrtbnfgl commented 1 year ago

That's not an issue because I think that those rules are added in append so if K3s is already started with all the rules in place they are always loaded after the K3s rules.

HOSTED-POWER commented 1 year ago

It resets them, pretty sure about this since I had some issues with docker rules. I will really miss the old behavior :(

I'll need to open all the ranges manually then I suppose.

rbrtbnfgl commented 1 year ago

I think that LOGDROPIN chain is redundant considering that you have DROP policy as default.

HOSTED-POWER commented 1 year ago

I don't think it's redundant, it's a chain used for logging. If you follow the input chain, you end up there. In any case I have no control over this. I also found a past issue with configserver not blocking traffic in all cases.

rbrtbnfgl commented 1 year ago

it's not only logging. It's dropping all the packets that arrive there.

k3s-io / k3s