k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.82k stars 2.33k forks source link

k3s pods are unable to reaching to the GCP Buckets #6117

Closed sd185406 closed 1 year ago

sd185406 commented 2 years ago

Environmental Info: K3s Version:

k3s version v1.24.4+k3s1 (c3f830e9) go version go1.18.1

Node(s) CPU architecture, OS, and Version:

Linux ctm-ubantu-vm 5.4.0-1087-gcp #95~18.04.1-Ubuntu SMP Mon Aug 22 03:26:39 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

It's a single node cluster where i have installed the k3's service and running the application on top of it Describe the bug:

i have created one ubuntu VM in the gcp cloud and i install k3 service and deployed my application helm chart on top of it however some of the pods need to connect to the GCP bucket and pull the images however it is not happening instead it is throwing the below error ERROR: gcloud crashed (ConnectionError): HTTPSConnectionPool(host='oauth2.googleapis.com', port=443)

Steps To Reproduce:

Expected behavior:

All pods need to be up&running

Actual behavior:

pods are not running

Additional context / logs:

root@ctm-ubantu-vm:~# kubectl get pods -n store NAME READY STATUS RESTARTS AGE jarvis-scoxcashdelegate-6c4bf57d76-p85g2 0/1 Init:0/2 0 56m jarvis-scoxcashservice-ddbbdcd46-d9245 0/1 Init:0/1 0 56m jarvis-scoxprinter-b4577cf87-l9htt 0/1 Init:0/1 0 56m jarvis-scoxdoc-bddcfd-5kms9 0/1 Init:0/1 0 56m jarvis-rediscache-69d48468dc-sgxcd 1/1 Running 0 56m jarvis-hivemqce-5888d55bf6-xwlhc 1/1 Running 0 56m jarvis-mongodb-5f6bf5d859-48jc4 1/1 Running 0 56m jarvis-jarvisconfigservice-784889995b-28klj 0/1 Init:CrashLoopBackOff 15 (3m4s ago) 56m jarvis-scoxresources-7b5977bdd8-kqcvk 0/1 Init:CrashLoopBackOff 15 (3m1s ago) 56m jarvis-scoxerrorlookup-5b5499d866-4h9m5 0/1 Init:CrashLoopBackOff 15 (2m45s ago) 56m jarvis-scoxauthentication-846dcc684c-m968j 0/1 Init:CrashLoopBackOff 15 (2m33s ago) 56m

root@ctm-ubantu-vm:~# kubectl logs -p jarvis-jarvisconfigservice-784889995b-28klj -n store --all-containers Initializing from GCS... Version: gc://scox-configservice-assets/assets/assets-1.6.0.zip Checking if file, assets-1.6.0.zip, is already in the destination path, /var/lib/ncr_scot/jarvisconfigservice/. OVERWRITE_LOCAL_COPY is set to false. Downloading from GCS ERROR: gcloud crashed (TransportError): HTTPSConnectionPool(host='oauth2.googleapis.com', port=443): Max retries exceeded with url: /token (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fa61e401ee0>: Failed to establish a new connection: [Errno -3] Try again'))

brandond commented 2 years ago

Can you successfully access the GCP bucket from the node itself?

sd185406 commented 2 years ago

yes I am able to access the GCP buckets from the node and list the files from inside of the buckets image

brandond commented 2 years ago

I see that you have a bunch of other pods that are also stuck in Init or are crashing. Are the pods for the packaged components running correctly? Can you attach the output of:

sd185406 commented 2 years ago

image logfiles.zip

Uploaded the files in zip.....PFA

brandond commented 2 years ago

Although they appear to be currently running, the logs show that many of the pods for packaged components (such as metrics-server) were stuck crashlooping from the beginning of the log at Sep 02 15:35:06 until Sep 02 19:48:45, just after a restart of the k3s service. It looks like you made some changes to the system configuration that allowed the pods to work. Can you provide more information on how you configured K3s (any CLI flags or configuration files you added) as well as information on what you changed around that time?

It also looks like the containerd log does not go back further than the restart at Sep 02 19:48 so I can't tell what was going on before that, but I suspect it is related to the problems with your workload pods.

sd185406 commented 2 years ago

I have configured the k3's service using the below steps

$ export CLUSTER_CIDR=192.168.10.0/24 $ export SERVICE_CIDR=192.168.20.0/24 $ export EXTERNAL_IP= Your_VM_External_IP $ export K3S_KUBECONFIG_MODE="644" $ export INSTALL_K3S_EXEC="--cluster-cidr $CLUSTER_CIDR --service-cidr $SERVICE_CIDR --node-external-ip $EXTERNAL_IP" $ curl -sfL https://get.k3s.io | sh - $ k3s -v $ systemctl status k3s.service

Just for a quick workaround, i have performed the below troubleshooting steps in order to find the root cause

  1. i have disabled the os firewalls (getenforce,iptables,firewalld)
  2. In inorder to check the connectivity.i have deployed one gcloud pod in the same ubuntu VM(where k3s is running) and logged in inside of the pod and ran the gcloud auth login however it has thrown the same error like below

ERROR: gcloud crashed (TransportError): HTTPSConnectionPool(host='oauth2.googleapis.com', port=443): Max retries exceeded with url: /token (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fa61e401ee0>: Failed to establish a new connection: [Errno -3] Try again'))

It seems something is blocking the pod network due to that it is unable to authenticate to the gcloud services

Rancher kube-system dns logs are returning like below

kubectl logs -f coredns-b96499967-rbqrg -n kube-system

[ERROR] plugin/errors: 2 oauth2.googleapis.com. AAAA: read udp 192.168.10.19:42945->8.8.8.8:53: i/o timeout [ERROR] plugin/errors: 2 oauth2.googleapis.com. A: read udp 192.168.10.19:55572->8.8.8.8:53: i/o timeout [ERROR] plugin/errors: 2 oauth2.googleapis.com. AAAA: read udp 192.168.10.19:58091->8.8.8.8:53: i/o timeout [ERROR] plugin/errors: 2 oauth2.googleapis.com. A: read udp 192.168.10.19:54176->8.8.8.8:53: i/o timeout [ERROR] plugin/errors: 2 oauth2.googleapis.com. A: read udp 192.168.10.19:35297->8.8.8.8:53: i/o timeout [ERROR] plugin/errors: 2 oauth2.googleapis.com. AAAA: read udp 192.168.10.19:40570->8.8.8.8:53: i/o timeout [ERROR] plugin/errors: 2 oauth2.googleapis.com. A: read udp 192.168.10.19:46817->8.8.8.8:53: i/o timeout [ERROR] plugin/errors: 2 oauth2.googleapis.com. AAAA: read udp 192.168.10.19:40210->8.8.8.8:53: i/o timeout [ERROR] plugin/errors: 2 oauth2.googleapis.com. A: read udp 192.168.10.19:35235->8.8.8.8:53: i/o timeout [ERROR] plugin/errors: 2 oauth2.googleapis.com. AAAA: read udp 192.168.10.19:45841->8.8.8.8:53: i/o timeout

brandond commented 2 years ago

Yeah, it looks like container networking is not working for some reason. Can I ask why you've changed the container and service CIDR ranges? Do your custom ranges overlap with any of the subnets that the node is on? Can you confirm that there are no GCP-level security group rules that are interfering with outbound connections?

sd185406 commented 2 years ago

No David,

i have checked all the egress and ingress rules in gcp which is allowed the connectivity to all my subnets

And coming to the CIDR ranges configuration yes we use above configurations for all our dev environments. we had setup the same environment in centos vm there we didn't face any challenges all pods are seamlessly communicating to the GCP services

I think there might be some additional network configurations required in ubuntu machines for rancher k3s setup or else some bridge has to be established between the network interfaces

brandond commented 2 years ago

There isn't generally any special setup necessary on Ubuntu. Is there anything else installed on this node that might be interfering with this traffic? Docker, additional software managing firewall configuration or iptables, etc?

sd185406 commented 2 years ago

sorry by mistakenly i closed it

There isn't generally any special setup necessary on Ubuntu. Is there anything else installed on this node that might be interfering with this traffic? Docker, additional software managing firewall configuration or iptables, etc?

no we haven't installed anything other than k3s......

picassio commented 2 years ago

I have this issue too. Fresh install k3s v1.24.4+k3s1 on Ubuntu 18.04 with the ufw disabled by default, and iptables version 1.6.1 . Look like the network has an issue. From pod, I can ping to the host IP, but cannot ping to the host network default gateway. Changed the host OS to Ubuntu 20.04, and everything work fine.