2nd ephemeral container not synced to host in some scenario

dee0sap commented 1 month ago

What happened?

( From https://loft-sh.slack.com/archives/C01N273CF4P/p1727892709938959 )

I have a scenario where a 2nd patch, which adds a second ephemeral container, of a pod doesn't get synched to the host. However this doesn't always happen. For example, the following shell script is not a problem

set -xe -o pipefail
kubectl run blah --image "bash" -- /usr/local/bin/bash -c "/usr/bin/tail -f /dev/null"
kubectl debug blah -c copy-on-detach-success-csidbg-sidecar --image "bash" -- /usr/local/bin/bash -c "/usr/bin/tail -f /dev/null"
kubectl debug blah -c copy-on-detach-success-csidbg --image "bash" -- /usr/local/bin/bash -c "/usr/bin/tail -f /dev/null"

However in the kubectl plugin I am working on the story is different. Some history on this effort of mine...

The plugin extends the kubectl debug command. Given the right set of flags it creates a sidecar ephemeral container in addition to the ephemeral container that kubectl debug code would normally create.

The normal flow of kubectl debug is something like:

Fetch target pod
Produce 'starter' Pod proposal with ephemeral container added
Pass the above to an extension so the extension can make any necessary adjustments
Compute patch to the target pod and then invoke Patch
Invoke overridable attach function to attach to debug container

When I started working on the plugin I was creating the sidecar ephemeral container in the overridable attach function.
Basically I was re-running the 'kubectl debug' code so that the sidecar would get created.

I initially had a problem where the sidecar would get added in vcluster but never show up in the host ( same symptom I have now ). I found that by updating my override to the attach code so that before attempting to create the sidecar it would wait for a container status to be visible for the debug container then all was good.

Because of a change in requirements, I had to move the creation of the sidecar into step #3. This means that now the sidecar ephemeral container is getting created prior to the debug container. Because of my previous experience I put in the wait for the sidecar container status to show up prior to moving onto the creation of the debug container.

However unlike before, where the second ephemeral container would show up on the host as long as I waited for some status to be visible on the first, the second ephemeral container, that is the debug container now, never shows up on the host.

Some things I tried, without luck, to get around this are :

In the plugin code set a breakpoint that will get hit after the first ephemeral container is visible on the host. After hitting the breakpoint, use a timer to make sure to wait 1 minute before letting the code proceed with creating the second ephemeral container. This made no difference.
Add a line of code that fetches the Pod after status for the 1st ephemeral container shows up. ( Just in case client API was caching the pod and that was somehow a problem )
Just to make sure it wasn't wasn't somehow related to the order of the container names used, put together the shell script I pasted above
Upgrade to 0.20.1 + switch from default vcluster distro and backing store to k8s and etcd

While using 0.20.1 and replicating the failure case, I observed the following

Before the 1st patch the most recent log message was 2024-10-03 23:35:13 INFO commandwriter/commandwriter.go:126 watch chan error: etcdserver: mvcc: required revision has been compacted {"component": "vcluster", "component": "apiserver", "location": "watcher.go:338"}

The resourceVersions were - host: 2677250701, vcluster: 468

After the first patch the most recent log messages were :

2024-10-03 23:38:01 INFO    event.copy-on-detach-success0.17fb15a549b026b0  syncer/syncer.go:149    create virtual event default/copy-on-detach-success0.17fb15a549b026b0   {"component": "vcluster"}
2024-10-03 23:38:01 INFO    event.copy-on-detach-success0.17fb15a54afcd3ae  syncer/syncer.go:149    create virtual event default/copy-on-detach-success0.17fb15a54afcd3ae   {"component": "vcluster"}
2024-10-03 23:38:01 INFO    pod.copy-on-detach-success0 syncer/syncer.go:137    update virtual pod default/copy-on-detach-success0, because status has changed  {"component": "vcluster"}
2024-10-03 23:38:01 INFO    event.copy-on-detach-success0.17fb15a54fcbcd5a  syncer/syncer.go:149    create virtual event default/copy-on-detach-success0.17fb15a54fcbcd5a   {"component": "vcluster"}
2024-10-03 23:38:02 INFO    pod.copy-on-detach-success0 syncer/syncer.go:137    update virtual pod default/copy-on-detach-success0, because status has changed  {"component": "vcluster"}
2024-10-03 23:38:05 INFO    commandwriter/commandwriter.go:126  watch chan error: etcdserver: mvcc: required revision has been compacted    {"component": "vcluster", "component": "apiserver", "location": "watcher.go:338"}

The resourceVersions were - host: 2677260040, vcluster: 496

The expected ephemeral container was visible both on the host and on the vcluster

Right before the 2nd patch there were these logs messages :

2024-10-03 23:39:42 INFO    commandwriter/commandwriter.go:126  watch chan error: etcdserver: mvcc: required revision has been compacted    {"component": "vcluster", "component": "apiserver", "location": "watcher.go:338"}
2024-10-03 23:39:50 ERROR   filters/wrap.go:54  timeout or abort while handling: method=GET URI="/api/v1/namespaces/default/pods?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dcopy-on-detach-success0&resourceVersion=506&timeout=5m23s&timeoutSeconds=323&watch=true" audit-ID="0f592933-94d3-414e-a4fe-ae8d4d594367"    {"component": "vcluster"}
2024-10-03 23:39:51 ERROR   filters/wrap.go:54  timeout or abort while handling: method=GET URI="/api/v1/namespaces/default/pods?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dcopy-on-detach-success0&resourceVersion=506&timeout=5m40s&timeoutSeconds=340&watch=true" audit-ID="89fe6f6e-9103-4dfd-ac67-82fe31cfe0f4"    {"component": "vcluster"}
2024-10-03 23:40:07 INFO    commandwriter/commandwriter.go:126  watch chan error: etcdserver: mvcc: required revision has been compacted    {"component": "vcluster", "component": "apiserver", "location": "watcher.go:338"}

The resource versions were unchanged from before

After the 2nd patch there were these log messages :

2024-10-03 23:40:07 INFO    commandwriter/commandwriter.go:126  watch chan error: etcdserver: mvcc: required revision has been compacted    {"component": "vcluster", "component": "apiserver", "location": "watcher.go:338"}
2024-10-03 23:40:41 INFO    commandwriter/commandwriter.go:126  watch chan error: etcdserver: mvcc: required revision has been compacted    {"component": "vcluster", "component": "apiserver", "location": "watcher.go:338"}

Resource version on host unchanged Resource version on vcluster 524

NOTE: The error log messages did not happen when the shell script ran.

Some minutes after the 2nd patch I used 'kubectl edit' to add an annotation to the target pod in the vcluster. The annotation was synced to the host cluster. However the list of ephemeral containers remained out of sync, there were two of them on the vcluster but only 1 on the host.

What did you expect to happen?

I expected the 2nd ephemeral container to appear on the pod in the host cluster.

How can we reproduce it (as minimally and precisely as possible)?

Not sure. Obviously using 'kubectl debug' doesn't reproduce the situation. With instruction, perhaps I can collect more logs which and those would reveal a sequence of API calls that would reproduce the problem.

Anything else we need to know?

Probably ;)

Host cluster Kubernetes version

```console WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:40:17Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"30", GitVersion:"v1.30.3", GitCommit:"6fc0a69044f1ac4c13841ec4391224a2df241460", GitTreeState:"clean", BuildDate:"2024-07-16T23:48:12Z", GoVersion:"go1.22.5", Compiler:"gc", Platform:"linux/amd64"} WARNING: version difference between client (1.26) and server (1.30) exceeds the supported minor version skew of +/-1 ```

vcluster version

```console vcluster version 0.20.1 ```

VCluster Config

``` sync: toHost: priorityClasses: enabled: true controlPlane: distro: k8s: enabled: true backingStore: etcd: deploy: enabled: true database: embedded: enabled: false proxy: extraSANs: - redacted statefulSet: scheduling: podManagementPolicy: OrderedReady experimental: deploy: vcluster: manifests: |- apiVersion: scheduling.k8s.io/v1 description: This priority class should be used as default for HC pods other than HANA pods. globalDefault: true kind: PriorityClass metadata: generation: 1 name: hc-default preemptionPolicy: Never value: 700000000 --- apiVersion: v1 data: .dockerconfigjson: redacted kind: Secret metadata: name: dockersecret namespace: default type: kubernetes.io/dockerconfigjson ```

dee0sap commented 1 month ago

Was reviewing doe to try and understand where the 'allowWatchBookMarks' calls would have come from. Probably they come from one of my calls to

watchtools.UntilWithSync, or
podClient.Pods(namespace).List

There is only one call to the first while there are 2 or 3 calls to the second.

The call to UntilWithSync is in the code that waits for container status to show up.

dee0sap commented 1 month ago

Commented out the code calling UntilWithSync. Before there were always two error messages logged back to back. Afterwards just one. Still the sync didn't happen like I expected.

dee0sap commented 1 month ago

Fwiw, confirmed no problem when not using vcluster.

FabianKramm commented 1 month ago

@dee0sap sorry for the late reply and thanks for creating this issue! It seems our logic seems a little outdated there and I'll try to reproduce this and fix it

FabianKramm commented 1 month ago

I think the problem was that when adding 2 ephemeral containers at once or too quickly it will get into issues, I will create a fix for this and would be great if you could test that then with the latest v0.21.0-beta

dee0sap commented 1 month ago

Thanks @FabianKramm

I have doubts that the problem is with the 2nd ephemeral container being added too quickly, but maybe it could be due to the 1st one being added too quickly after pod creation.

I say this because one of the things I did when trying to debug was to set a breakpoint in my code just before the point where it would add the second ephemeral contain and then, when the breakpoint was hit, I waited 1 minute, as measured by the timer on my cell phone, before allowing the execution to continue. This didn't make a difference. ( I mentioned the original description )

However... I didn't do anything similar between the point where the pod is initially created and the first ephemeral container is created. And, that would be a significant difference between what is happening in the testing of my code and that shell script I shared in the description, that is the shell script that did not replicate the problem.

So I'll find time today to do another test, one were I insert a long pause between the initial creation of the pod and the addition of the first ephemeral container and I'll share whatever the results are here.

And of course when you have a candidate fix ready I'll give it a shot as well.

dee0sap commented 1 month ago

Hey @FabianKramm

So for testing purposes, I updated my code, and test code, so that

There was 1 full minute delay between the pod being scheduled and the add of the first ephemeral container
There was 1 full minute delay between the first ephemeral container's container id being available and the addition of the second ephemeral container

Same result.... in vcluster both ephemeral containers get added but on the host cluster only the first is added. I don't notice any problem with the first ephemeral container, it seems to run fine on the host.

FabianKramm commented 1 month ago

@dee0sap thanks for the additional information, we released v0.21.0-beta.2 that should include a fix for this, would you mind testing that version and see if it solved the problem?

dee0sap commented 1 month ago

Hey @FabianKramm Seems fixed in version 0.21.0-beta.2. However I did run into a couple of problems moving from 0.19.6 to the new version

Init manifests aren't applied before any pods are created. I need to make sure the default PriorityClass that is defined on the host is copied into the vcluster before any pods are created. With 0.19.6 I was able to do this .init.manifests. I am trying to do the same thing with the new version but in .experimental.deploy.vcluster.manifests. While the PriorityClass is copied over it isn't before the coredns is created. To get around this I am deleting the coredns pod in the vcluster right after vcluster creation and before doing anything else.
It seems vcluster is no longer able to read values through a symlink. My shell script for creating the vcluster invoked vcluster create like vcluster create ... -f <(cat <<EOF... The process substitution I was using yields a symlink. After switching to the new version vcluster acted like I hadn't passed a values file at all. I worked around this by first saving the config I am generating to a normal file and then passing the name of that file to vcluster.

Please let me know if I should open up issues for either of these items.

FabianKramm commented 3 weeks ago

@dee0sap thanks for the feedback and sorry for the late reply, I was on vacation. I can fix the first issue you mentioned, but for the second would be great if you create a separate issue for this. Also glad to hear the problem is fixed!

loft-sh / vcluster