Pods not syncing quickly: pod-syncer Error updating pod: context deadline exceeded

alextricity25 commented 1 month ago

What happened?

I'm experiencing this rather strange bug where some pods are reporting to be stuck in the "Init: 0/1" status, but only when connected to the vcluster context. When connected to the host cluster context, the pod status reports correctly, however, the pod status in the vcluster occasionally is stuck in either "PendingCreate" or the "Init: 0/1" status, which causes downstream issues with my helm chart installation flow. The pod events while connected to the vcluster context report the following:

Warning  SyncError  13m   pod-syncer         Error updating pod: context deadline exceeded

The below image shows the bug in action with two panes. The left pane is K9s connected to the vcluster context, and the right connected to the host cluster context. As you can see, the pods in the host cluster are "Running", but the same pods in the vcluster context are suck in the "Init:0/1" status.

Looking at the pod events, I see the following:

The pod-syncer error only appears when connected to the vcluster.

The only "error" that I noticed in the vcluster syncer logs is:

filters/wrap.go:54  timeout or abort while handling: method=GET URI="/api/v1/namespaces/xrdm/pods/xxxx-portal-worker-66bc6969b-r4qqr/log?container=xxxx-portal-worker&follow=true&tailLines=100&timestamps=true" audit-ID="1d310b7a-7010-4c0e-a116-7b4127d94193"

What did you expect to happen?

I expect the pods status while connected to the vcluster context to reflect the correct status.

How can we reproduce it (as minimally and precisely as possible)?

Install vcluster version v0.20.0-beta.5 using the helm chart.
Connect to the vcluster and create a bunch of containers
Observe that the pod-syncer is not working as intended

Anything else we need to know?

The pods while connected to the vcluster context will eventually report the correct status, but sometimes it takes 5 or 10 minutes before it does.

Host cluster Kubernetes version

```console $ kubectl version Client Version: v1.29.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.0 ```

Host cluster Kubernetes distribution

``` GKE-1.29.0 ```

vlcuster version

```console v0.20.0-beta.5 ```

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

``` k8s ```

OS and Arch

``` OS: GKE containerd image Arch: ```

heiko-braun commented 1 month ago

Let me summarise: the pods get scheduled on the host, but the vcluster api server doesn’t reflect the status correctly?

Do you observe significant load (i.e. request latency, total requests increased) on the host api server when this happens?

everflux commented 1 month ago

Looks similar to https://github.com/loft-sh/vcluster/issues/1589 to me

alextricity25 commented 1 month ago

@heiko-braun

Let me summarise: the pods get scheduled on the host, but the vcluster api server doesn’t reflect the status correctly?

That's correct!

Do you observe significant load (i.e. request latency, total requests increased) on the host api server when this happens?

I do not. The vcluster pods on the host cluster are given a good amount of resources. 3 vCPUs, 4Gi of memory. Usually the node this vcluster pod is on is no where near these limits.

@everflux

Looks similar to https://github.com/loft-sh/vcluster/issues/1589 to me

Yes, indeed! I suppose this issue can be marked as a duplicate. Thanks for catching that!

FabianKramm commented 1 month ago

@alextricity25 would you mind trying virtual Kubernetes version v1.29.4 as there was a pretty significant bug in v1.29.0 that caused issues (https://github.com/kubernetes/kubernetes/issues/123448), which could be the problem for this

alextricity25 commented 1 month ago

@alextricity25 would you mind trying virtual Kubernetes version v1.29.4 as there was a pretty significant bug in v1.29.0 that caused issues (kubernetes/kubernetes#123448), which could be the problem for this

@FabianKramm The default is v1.29.0. I'll override this to 1.29.4 to see if that does anything. It's difficult to iterate and test to see if this certain changes fixes this issue, because it doesn't happen all the time. I'll let you know!

everflux commented 1 month ago

@FabianKramm I tested 1.29.4 (Server Version: v1.29.4+k3s1, I think embedded db/sqlite) with helm chart v0.20.0-beta.5 and still observed the issue once the metrics server was present in the host cluster. (host is 1.18.2)

alextricity25 commented 1 month ago

@FabianKramm I also observed the issue again on 1.29.5. Screen shot below. The pods in the vCluster context eventually did report the correct status, but it took about 1-2 minutes before they did.

alextricity25 commented 1 month ago

Interesting side note, when this happens, restarting the vcluster pod running the syncer service gets things running again.

FabianKramm commented 1 week ago

@alextricity25 would you mind sharing the logs? Are you using NFS backing store by any chance?

FabianKramm commented 1 week ago

@alextricity25 is there a way for you to collect metrics from the vCluster itself? Basically what we are looking for is the range(apiserver_watch_cache_events_dispatched_total[1m]) metric, because that will tell us if vCluster has received any events if greater zero when this happens (which would indicate its a problem on our syncer side) or if the metric is zero / close to zero this would indicate its a problem within Kubernetes or the backing store itself.

FabianKramm commented 1 week ago

@alextricity25 I ran a test in GKE v1.30.1 with vCluster v0.20.0-beta.10 with the following vcluster.yaml:

controlPlane:
  distro:
    k8s:
      apiServer:
        image:
          tag: v1.29.5
      resources:
        limits:
          cpu: 2
          memory: 4Gi
        requests:
          cpu: 2
          memory: 4Gi
      securityContext:
        seccompProfile:
          type: Unconfined
  hostPathMapper:
    enabled: true
  statefulSet:
    resources:
      limits:
        cpu: "3"
        ephemeral-storage: 8Gi
        memory: 10Gi
      requests:
        cpu: "3"
        ephemeral-storage: 8Gi
        memory: 10Gi

sync:
  fromHost:
    nodes:
      enabled: true
      syncBackChanges: true
  toHost:
    persistentVolumeClaims:
      enabled: true
    serviceAccounts:
      enabled: true

And then deployed a deployment.yaml with:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  replicas: 200
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      annotations:
        my-annotation: my-annotation
      labels:
        app: test
    spec:
      containers:
        - name: nginx
          image: nginx
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
            requests:
              cpu: 50m
              memory: 64Mi

And then changed the deployment a couple of times, but it never took longer than a minute to get all those 200 pods to Running. Do you have any special configuration for GKE or are using a special deployment? How many pods are we talking about and how much load is the vCluster under?

FabianKramm commented 1 week ago

@alextricity25 good news, I think we found the issue, its related to us using an older kine version, the linked PR should fix this and will be part of the next release.

alextricity25 commented 1 week ago

@FabianKramm Thanks for all of your investigation on this issue. I really appreciate it and it's not gone unnoticed.

About a week ago, we switched our distro and version from k8s v1.29.5 to k3s v1.30.2-rc3-k3s1. Ever since then, we haven't been observing these timeouts or latency issues between the host pods and the vcluster pods. Actually, things have been running pretty smoothly. Do you think the version bump to v1.30.2-rc3-k3s1 may be related to the fix in your findings, or it could be something completely different?

In regards to your requests, it's possible that I can switch back to the k8s distro with the previous version we had defined to see if I can reproduce the problem and gather the logs and metrics you are asking for. We are not running an NFS backing store. We are running a pretty standard Dataplane v2 GKE cluster and using the vcluster helm chart to install vcluster with the values I gave above. This will have to wait until next week for me, though.

Another interesting thing to note is that we only saw this problem when a node with the vcluster syncer service was running for a few days. Restarting the node would temporarily fix the issue, but after a while any other vcluster syncer services that landed on that node would be impacted by this issue.

FabianKramm commented 1 week ago

@alextricity25 thanks for the response! The problem we identified lies within Kine up to version v0.11.4, which had a bug that caused this issue where watch requests from the vCluster syncer were starved and vCluster wouldn't get any notifications anymore, which lead to the pod statuses not being updated in time. Restarting the vCluster worked because Kine was restarted and the watch request was reestablished.

Upgrading to k3s version v1.30.2-rc3-k3s1 fixes the problem (as well as using etcd as a backend instead of Kine) because they have upgraded the internally used Kine version there to v0.11.9 as well. Using older k3s versions that use a Kine version below v0.11.4 will still have that problem.

Since k8s & vCluster v0.20 was using Kine version v0.11.1, this bug was also present there and caused the symptoms you saw. To fix this, we upgraded Kine to v0.11.10 in the PR #1894, which is now part of the new beta release v0.20.0-beta.11 that should not have this issue anymore. If you find time, would be nice if you could validate that, but we are very confident that this caused your issue and is now fixed.

everflux commented 1 week ago

Thank you so much for your work this hopefully solves a really annoying issue. I would love to help validating the fix! But I have no clue how to trigger it. I used shell scripts creating hundreds of pods, deleting them, creating new ones, all worked like a charm. Until a handful of real users tried basically the same stuff and the issue appeared. so any hints how to force it so I can give it a run?

FabianKramm commented 1 week ago

@everflux I'm not sure how to trigger it, within my testing I could only reproduce it once yesterday with Kine v0.11.1, it really occurs randomly as I tried to do the exact steps again and didn't had the issue again with the same configuration.

But we have a customer who told us basically the same thing as @alextricity25, that they tried many different k3s versions, the newer ones all worked, while the older ones all had that problem sooner or later, so I compared the Kine versions there and found out that all versions that used up until v0.11.3 had that issue, while all newer Kine versions seem to work fine.

loft-sh / vcluster