Closed cwayne18 closed 1 year ago
I think that the problem is lack of iptables. We don't enable full kube-proxy replacement in Cilium by default, IPVS backend is not used either, so lack of iptables results in services not being available. Cilium logs are clearly showing that there is no connection to the k8s-apiserver service.
I would try to disable kube-proxy and enable full kube-proxy replacement in Cilium.
In order to do that, I would change /etc/rancher/rke2/config.yaml to disable kube-proxy:
cluster-cidr: 10.220.0.0/16
service-cidr: 10.221.0.0/16
cni: cilium
disable:
- rke2-kube-proxy
kube-apiserver-arg:
- anonymous-auth=true
kube-scheduler-arg:
- address=0.0.0.0
kube-controller-manager-arg:
- address=0.0.0.0
node-label:
- "cluster=mgt"
selinux: false
server: https://rke-mgt-01.css.ch:9345
system-default-registry: artifactory.css.ch
token: AzELaz7f2ny7pm4CfwbT8tWEVAK7T1XXXXXXXXXXXXXXXXXXXOSUMw00QaYP7kX9X1BtwH
tls-san:
- rke-mgt-01.css.ch
- rke-mgt-api.css.ch
profile: cis-1.6
audit-policy-file: /etc/rancher/rke2/audit-policy.yaml
on the first manager node:
cluster-cidr: 10.220.0.0/16
service-cidr: 10.221.0.0/16
cni: cilium
disable:
- rke2-kube-proxy
kube-apiserver-arg:
- anonymous-auth=true
kube-scheduler-arg:
- address=0.0.0.0
kube-controller-manager-arg:
- address=0.0.0.0
node-label:
- "cluster=mgt"
selinux: false
system-default-registry: artifactory.css.ch
token: AzELaz7f2ny7pm4CfwbT8tWEVAK7T1LnBZKHyXXXXXXXXXXXSUMw00QaYP7kX9X1BtwH
tls-san:
- rke-mgt-01.css.ch
- rke-mgt-api.css.ch
profile: cis-1.6
And then modify the Cilium config to enable kube-proxy replacement:
rkeConfig:
chartValues:
rke2-cilium:
cilium:
hubble:
metrics:
enabled:
- dns:query;ignoreAAAA
- drop
- tcp
- flow
- icmp
- http
relay:
enabled: true
image:
repository: cilium/hubble-relay
tag: v1.10.4
ui:
backend:
image:
repository: cilium/hubble-ui-backend
tag: v0.8.0
enabled: true
frontend:
image:
repository: cilium/hubble-ui
tag: v0.8.0
ingress:
annotations: {}
enabled: true
hosts:
- hubble-dev.css.ch
tls:
- hosts:
- hubble-dev.css.ch
secretName: tls-certificates-dev-hubble
proxy:
image:
repository: envoyproxy/envoy
replicas: 1
image:
repository: rancher/mirrored-cilium-cilium
tag: v1.10.4
nodeinit:
image:
repository: rancher/mirrored-cilium-startup-script
tag: 62bfbe88c17778aad7bef9fa57ff9e2d4a9ba0d8
operator:
image:
repository: rancher/mirrored-cilium-operator
tag: v1.10.4
preflight:
image:
repository: rancher/mirrored-cilium-cilium
tag: v1.10.4
kubeProxyReplacement: "strict"
k8sServiceHost: 10.150.85.45
k8sServicePort: 6443
Please replace the value of k8sServiceHost with the IP addresss of your control-plane. It's the best if a load balancer is used, but if there is no load balancer, I would just use the address of the first control-plane node.
I think that the problem is lack of iptables. I would try to disable kube-proxy and enable full kube-proxy replacement in Cilium.
Would installing iptables on the nodes also resolve the problem? I would have expected kubelet and kube-proxy to use the iptables that's bundled in the image.
Hi there,
Any update on this one? I experience the same issue although I have iptables installed:
$ rpm -qa | grep iptables
iptables-ebtables-1.8.4-20.el8.x86_64
iptables-1.8.4-20.el8.x86_64
iptables-libs-1.8.4-20.el8.x86_64
The Cilium DS agents show the following error:
level=info msg="Auto-disabling \"enable-bpf-clock-probe\" feature since KERNEL_HZ cannot be determined" error="Cannot probe CONFIG_HZ" subsys=daemon
level=info msg="Using autogenerated IPv4 allocation range" subsys=node v4Prefix=10.83.0.0/16
level=info msg="Initializing daemon" subsys=daemon
level=info msg="Establishing connection to apiserver" host="https://100.68.0.1:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://100.68.0.1:443" subsys=k8s
level=error msg="Unable to contact k8s api-server" error="Get \"https://100.68.0.1:443/api/v1/namespaces/kube-system\": dial tcp 100.68.0.1:443: i/o timeout" ipAddr="https://100.68.0.1:443" subsys=k8s
level=fatal msg="Unable to initialize Kubernetes subsystem" error="unable to create k8s client: unable to create k8s client: Get \"https://100.68.0.1:443/api/v1/namespaces/kube-system\": dial tcp 100.68.0.1:443: i/o timeout" subsys=daemon
Here my configurations (the relevant parts of it):
/var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml
:
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rke2-cilium
namespace: kube-system
spec:
valuesContent: |-
cilium:
kubeProxyReplacement: "strict"
k8sServiceHost: rancher.k8s.example.com
k8sServicePort: 6443
ipam:
operator:
clusterPoolIPv4PodCIDRList:
- "100.64.0.0/14"
/etc/rancher/rke2/config.yaml
:
cluster-cidr: "100.64.0.0/14"
service-cidr: "100.68.0.0/16"
cluster-dns: "100.68.0.10"
selinux: "true"
cni: "cilium"
disable-kube-proxy: "true"
disable:
- rke2-ingress-nginx
Thanks & regards, Philip
Edit: I just realized the helm-install-rke2-cilium
job does not seem to update the kube-system/cilium-config
CM properly. kube-proxy-replacement
is still set to disabled
...
As of the most recent round of releases, the chart values should no longer be nested under a cilium
key.
valuesContent: |-
kubeProxyReplacement: "strict"
k8sServiceHost: rancher.k8s.example.com
k8sServicePort: 6443
ipam:
operator:
clusterPoolIPv4PodCIDRList:
- "100.64.0.0/14"
Nice one 😅 ... That actually resolved my issue. The cilium-config
CM now also has the proper flags set:
kube-proxy-replacement: strict
kube-proxy-replacement-healthz-bind-address: ""
BTW, I'm running RKE2 v1.22.8+rke2r1
.
Thanks, @brandond!
Regards, Philip
I stumpled by chance over this issue and it is very known to me - especially the config snippets :-). It seems that it has been created by a Rancher/SUSE employee during our Migration journey from Suse CaaS to Rancher. Fortunately root cause could found and fixed (it was a missmatch in the Linux Netstack IPv4/Ip4 configs).
So feel free to close this issue since it no longer has relevance!
Issue description: The cluster is having network issue some time.
They end up with DNS pod failing to make any request.
the issue is the Kubernetes API is not reachable.
Business impact:
The cluster is unstable and barely unusable
Troubleshooting steps: Cluster nodes are able to talk to the Rancher.
All the nodes are in the same subnet, no firewall between them. Local node firewall is disabled aswell. The nodes can reach port 6443 and 9345.
iptables is not installed on the nodes.
They do not use kube-proxy in IPVS mode.