Closed jim-barber-he closed 2 years ago
The exact same cluster after I've edited it and removed spec.kubelet.cpuManagerPolicy:static
; updated the cluster; and rolled all the nodes now looks like this when I exec to a pod:
$ kubectl exec -t -i aws-node-44zdp -- sh
Defaulted container "aws-node" out of: aws-node, aws-vpc-cni-init (init)
sh-4.2#
I should also add that creating a Kubernetes 1.23.9 cluster with kops 1.24.0 ends up in the same bad state. But if I create the exact same Kubernetes 1.23.9 cluster using kops 1.23.2 then the cluster is healthy.
Managed to reproduce. This is also breaking:
ctr -n k8s.io task exec -t --exec-id sh_1 <container id> sh
There is a somewhat similar issue at https://github.com/containerd/containerd/issues/7219
what's the runc version you are using?
VERSION:
1.1.3
commit: v1.1.3-0-g6724737f
spec: 1.0.2-dev
go: go1.17.10
libseccomp: 2.5.4
``
VERSION: 1.1.3 commit: v1.1.3-0-g6724737f spec: 1.0.2-dev go: go1.17.10 libseccomp: 2.5.4 ``
I think we've hit on the same issue. You can have a try with runc 1.1.2. It works on my cluster. But I haven't digged deeper into this compatibility problem so that I didn't update the issue opened on containerd side. There must be some fixes on 1.1.3 imported this issue.
I can confirm that runc 1.1.2 works.
@hakman should we downgrade or wait for a fix?
Let's wait and add block-next to this issue. I don't think there is a plan for another issue in the next 2 weeks.
Sounds good
This issue is causing a terrible damage in all our test environments and in some production ones that were upgraded recently.
Neither our developers nor the Ops team members can't exec into the pods at any container at all. Also, some schedule jobs are failing due to now being able to process the exec calls (backups, internal calls...).
Honestly, setting this a blocks-next instead releasing immediately a fix/release is a terribly wrong decision. The impact of this is pretty big. We are not talking about a RC but a stable release affected for a big issue.
We do not have the option "cpuManagerPolicy: static" configured in our manifest. On the other hand, we have "cpuCFSQuota: false".
For now, I have applied the following patch which need to be configured at every instance group to downgrade runc to 1.1.2:
$ kops edit ig --name=${KOPS_CLUSTER_NAME} nodes-test-2
...
spec:
additionalUserData:
- content: |
#!/bin/sh
echo "xdowngradecrun.sh: rolling back to runc 1.1.2"
sudo /usr/bin/wget -q https://github.com/opencontainers/runc/releases/download/v1.1.2/runc.amd64 -O /usr/sbin/runc
sudo chmod 755 /usr/sbin/runc
name: xdowngradecrun.sh
type: text/x-shellscript
It is still on testing. I changed the runc manually in a node yesterday and could exec into a container after restarting it. On the other hand, this morning I couldn't again exec into the test container. I am going to apply this change globally in the cluster to get the proxy containers running using the right runc version and check if this fix works for now until a new kops release get this solved.
IMHO, releasing a new kops 1.24 version downgrading runc should be done immediately. Additionally, I would like to encourage you to add a 'exec' call as part of the tests done before accepting a new release to be promoted as stable.
Thank you for your guidance looking for the root issue, I will update this comment if this solution keeps stable for more than 24h.
It's only with certain configurations that kubectl exec
fails. We do a number of exec tests as part of the kubernetes e2e test suite (e.g https://testgrid.k8s.io/kops-versions#kops-aws-k8s-latest). We test a fairly huge amont of configurations, but all permutations is not possible. When non-standard configurations are used, we highly encourage testing the betas. As this issue made it into your production environments as well, I am sure you can appreciate how hard it is to catch such issues.
I will always appreciate the hard work you all make!
That being said, I would still recommend a new release now that 2 cases have been discovered:
In 1.24 branch you now have the ability to configure runc version. See https://kops.sigs.k8s.io/cluster_spec/#runc-version-and-packages
You can get the latest build at this location: $(curl https://storage.googleapis.com/k8s-staging-kops/kops/releases/markers/release-1.24/latest-ci.txt)/linux/amd64/kops
Chane OS/arch as appropriate. Please test if you have the chance.
Right now, we believe there will be another 1-2 weeks before we do a stable release.
runc version 1.1.4 was released a few hours ago. :excited:
/kind bug
1. What
kops
version are you running? The commandkops version
, will display this information.2. What Kubernetes version are you running?
kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag.3. What cloud provider are you using?
aws
4. What commands did you run? What is the simplest way to reproduce this issue?
After preparing the AWS account for the cluster I'm using the
kops create -f
command with a manifest file to define the cluster, thenkops update cluster --admin --name $CLUSTER_NAME --yes
to bring it up. Once the cluster is ready, usingkubectl exec -it
to run a command on any pod in the cluster results in an error like so:If I modify the cluster manifest to remove the
spec.kublet.cpuManagerPolicy: static
entry from it and recreate the cluster (or just update it and roll the nodes), then the problem is gone and everything is working as expected.5. What happened after the commands executed?
The cluster comes up and kops validates it properly, and all pods are in the
Running
state and appear to be ready. However most pods cannot open/dev/pts/0
and so are actually broken. If I exec to any pod in the cluster the error can be seen like so:6. What did you expect to happen?
Pods shouldn't throw errors about having a permission problem when opening /dev/pts/0.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information.The AWS account number has been replaced with
000000000000
in the manifest below and many other things replaced withREDACTED
.8. Please run the commands with most verbose logging by adding the
-v 10
flag. Paste the logs into this report, or in a gist and provide the gist link here.9. Anything else do we need to know?
I've tried various manifests and even one that is significantly cut down but using the
cpuManagerPolicy: static
directive ends up in the same bad state.