"cpumanagerpolicy: static" breaks on kops 1.24.0

jim-barber-he commented 2 years ago

/kind bug

1. What kops version are you running? The command kops version, will display this information.

$ kops version        
Client version: 1.24.0 (git-v1.24.0)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

$ kubectl version --short                                                   
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.24.3
Kustomize Version: v4.5.4
Server Version: v1.24.3

3. What cloud provider are you using?

aws

4. What commands did you run? What is the simplest way to reproduce this issue?

After preparing the AWS account for the cluster I'm using the kops create -f command with a manifest file to define the cluster, then kops update cluster --admin --name $CLUSTER_NAME --yes to bring it up. Once the cluster is ready, using kubectl exec -it to run a command on any pod in the cluster results in an error like so:

error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "1c7c93d244e4122381266e507066a2b73b8571187f2f1675c190717860611829": OCI runtime exec failed: exec failed: unable to start container process: open /dev/pts/0: operation not permitted: unknown

If I modify the cluster manifest to remove the spec.kublet.cpuManagerPolicy: static entry from it and recreate the cluster (or just update it and roll the nodes), then the problem is gone and everything is working as expected.

5. What happened after the commands executed?

The cluster comes up and kops validates it properly, and all pods are in the Running state and appear to be ready. However most pods cannot open /dev/pts/0 and so are actually broken. If I exec to any pod in the cluster the error can be seen like so:

$ kubectl exec -t -i aws-node-b4khg -- sh                        
Defaulted container "aws-node" out of: aws-node, aws-vpc-cni-init (init)
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "1c7c93d244e4122381266e507066a2b73b8571187f2f1675c190717860611829": OCI runtime exec failed: exec failed: unable to start container process: open /dev/pts/0: operation not permitted: unknown

6. What did you expect to happen?

Pods shouldn't throw errors about having a permission problem when opening /dev/pts/0.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

The AWS account number has been replaced with 000000000000 in the manifest below and many other things replaced with REDACTED.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2022-07-19T07:57:47Z"
  name: test4.he0.io
spec:
  additionalNetworkCIDRs:
  - 10.195.0.0/16
  additionalPolicies:
    master: |
      [
        {
          "Action": [
            "sts:AssumeRole"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:iam::000000000000:role/kiam.*"
          ]
        }
      ]
  api:
    loadBalancer:
      class: Network
      sslCertificate: arn:aws:acm:ap-southeast-2:000000000000:certificate/REDACTED
      sslPolicy: ELBSecurityPolicy-TLS13-1-3-2021-06
      type: Internal
  authentication:
    aws:
      backendMode: CRD
      clusterID: test4.he0.io
      identityMappings:
      - arn: arn:aws:iam::000000000000:role/AWSReservedSSO_AwsAdmin_12617084b159f311
        groups:
        - system:masters
        username: admin:{{SessionNameRaw}}
      - arn: arn:aws:iam::000000000000:role/AWSReservedSSO_Developer_7de86b085b6056ae
        groups:
        - he:dev
        username: dev:{{SessionNameRaw}}
  authorization:
    rbac: {}
  certManager:
    enabled: true
    managed: false
  channel: stable
  cloudControllerManager: {}
  cloudLabels:
    he:Application: kubernetes
    he:EnvironmentName: test4
    he:EnvironmentType: testing
  cloudProvider: aws
  configBase: s3://REDACTED/test4.he0.io
  dnsZone: Z08242392GCJTDNB7BIYT
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-ap-southeast-2a
      kmsKeyId: arn:aws:kms:ap-southeast-2:000000000000:key/REDACTED
      name: a
    - encryptedVolume: true
      instanceGroup: master-ap-southeast-2b
      kmsKeyId: arn:aws:kms:ap-southeast-2:000000000000:key/REDACTED
      name: b
    - encryptedVolume: true
      instanceGroup: master-ap-southeast-2c
      kmsKeyId: arn:aws:kms:ap-southeast-2:000000000000:key/REDACTED
      name: c
    memoryRequest: 900Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-ap-southeast-2a
      kmsKeyId: arn:aws:kms:ap-southeast-2:000000000000:key/REDACTED
      name: a
    - encryptedVolume: true
      instanceGroup: master-ap-southeast-2b
      kmsKeyId: arn:aws:kms:ap-southeast-2:000000000000:key/REDACTED
      name: b
    - encryptedVolume: true
      instanceGroup: master-ap-southeast-2c
      kmsKeyId: arn:aws:kms:ap-southeast-2:000000000000:key/REDACTED
      name: c
    memoryRequest: 200Mi
    name: events
  externalDns:
    provider: external-dns
  fileAssets:
  - content: |
      -----BEGIN RSA PRIVATE KEY-----
      REDACTED
      -----END RSA PRIVATE KEY-----
    name: sa-signer.key
    path: /srv/kubernetes/kube-apiserver/sa-signer.key
    roles:
    - Master
  - content: |
      -----BEGIN PUBLIC KEY-----
      REDACTED
      -----END PUBLIC KEY-----
    name: sa-signer-pkcs8.pub
    path: /srv/kubernetes/kube-apiserver/sa-signer-pkcs8.pub
    roles:
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    apiAudiences:
    - sts.amazonaws.com
    logFormat: json
    maxMutatingRequestsInflight: 400
    maxRequestsInflight: 800
    serviceAccountIssuer: https://REDACTED.s3-ap-southeast-2.amazonaws.com
    serviceAccountKeyFile:
    - /srv/kubernetes/kube-apiserver/sa-signer-pkcs8.pub
    - /srv/kubernetes/kube-apiserver/server.key
    serviceAccountSigningKeyFile: /srv/kubernetes/kube-apiserver/sa-signer.key
  kubeControllerManager:
    logFormat: json
  kubeDNS:
    externalCoreFile: |
      .:53 {
        log
        errors
        health {
          lameduck 5s
        }
        ready
        kubernetes cluster.local. in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
          ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
          max_concurrent 1000
        }
        cache {
          success 9984 30
          denial 9984 30
          prefetch 1 300s 15%
        }
        loop
        reload
        loadbalance
      }
    nodeLocalDNS:
      enabled: true
      memoryRequest: 32Mi
  kubeScheduler:
    logFormat: json
  kubelet:
    allowedUnsafeSysctls:
    - net.core.somaxconn
    - net.ipv4.tcp_keepalive_time
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    cpuManagerPolicy: static
    evictionHard: memory.available<200Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<15%,imagefs.inodesFree<5%
    kubeReserved:
      cpu: 100m
      ephemeral-storage: 1Gi
      memory: 150Mi
    logFormat: json
    systemReserved:
      cpu: 100m
      ephemeral-storage: 1Gi
      memory: 120Mi
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.24.3
  masterInternalName: api.internal.test4.he0.io
  masterPublicName: api.test4.he0.io
  metricsServer:
    enabled: true
    insecure: true
  networkCIDR: 10.200.0.0/16
  networkID: vpc-REDACTED
  networking:
    amazonvpc:
      env:
      - name: AWS_VPC_K8S_CNI_EXTERNALSNAT
        value: "true"
      - name: AWS_VPC_K8S_CNI_LOG_FILE
        value: stdout
      - name: AWS_VPC_K8S_PLUGIN_LOG_FILE
        value: stderr
      - name: ENABLE_PREFIX_DELEGATION
        value: "true"
      - name: WARM_PREFIX_TARGET
        value: "1"
  nonMasqueradeCIDR: 10.195.0.0/16
  sshAccess:
  - 127.0.0.1/32
  sshKeyName: REDACTED
  subnets:
  - cidr: 10.195.32.0/19
    egress: nat-08de68609cdf6cc58
    name: ap-southeast-2a
    type: Private
    zone: ap-southeast-2a
  - cidr: 10.195.64.0/19
    egress: nat-00ef20ab2a9408b7c
    name: ap-southeast-2b
    type: Private
    zone: ap-southeast-2b
  - cidr: 10.195.96.0/19
    egress: nat-0032baadf115cfbde
    name: ap-southeast-2c
    type: Private
    zone: ap-southeast-2c
  - cidr: 10.195.0.0/22
    name: utility-ap-southeast-2a
    type: Utility
    zone: ap-southeast-2a
  - cidr: 10.195.4.0/22
    name: utility-ap-southeast-2b
    type: Utility
    zone: ap-southeast-2b
  - cidr: 10.195.8.0/22
    name: utility-ap-southeast-2c
    type: Utility
    zone: ap-southeast-2c
  sysctlParameters:
  - net.ipv4.tcp_keepalive_time=200
  topology:
    dns:
      type: Private
    masters: private
    nodes: private

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I've tried various manifests and even one that is significantly cut down but using the cpuManagerPolicy: static directive ends up in the same bad state.

jim-barber-he commented 2 years ago

The exact same cluster after I've edited it and removed spec.kubelet.cpuManagerPolicy:static; updated the cluster; and rolled all the nodes now looks like this when I exec to a pod:

$ kubectl exec -t -i aws-node-44zdp -- sh       
Defaulted container "aws-node" out of: aws-node, aws-vpc-cni-init (init)
sh-4.2#

jim-barber-he commented 2 years ago

I should also add that creating a Kubernetes 1.23.9 cluster with kops 1.24.0 ends up in the same bad state. But if I create the exact same Kubernetes 1.23.9 cluster using kops 1.23.2 then the cluster is healthy.

olemarkus commented 2 years ago

Managed to reproduce. This is also breaking:

ctr -n k8s.io task exec -t --exec-id sh_1 <container id> sh

There is a somewhat similar issue at https://github.com/containerd/containerd/issues/7219

JohnJAS commented 2 years ago

what's the runc version you are using?

olemarkus commented 2 years ago


VERSION:
   1.1.3
commit: v1.1.3-0-g6724737f
spec: 1.0.2-dev
go: go1.17.10
libseccomp: 2.5.4
``

JohnJAS commented 2 years ago

VERSION:
   1.1.3
commit: v1.1.3-0-g6724737f
spec: 1.0.2-dev
go: go1.17.10
libseccomp: 2.5.4
``

I think we've hit on the same issue. You can have a try with runc 1.1.2. It works on my cluster. But I haven't digged deeper into this compatibility problem so that I didn't update the issue opened on containerd side. There must be some fixes on 1.1.3 imported this issue.

olemarkus commented 2 years ago

I can confirm that runc 1.1.2 works.

@hakman should we downgrade or wait for a fix?

hakman commented 2 years ago

Let's wait and add block-next to this issue. I don't think there is a plan for another issue in the next 2 weeks.

olemarkus commented 2 years ago

Sounds good

nullzone commented 2 years ago

This issue is causing a terrible damage in all our test environments and in some production ones that were upgraded recently.

Neither our developers nor the Ops team members can't exec into the pods at any container at all. Also, some schedule jobs are failing due to now being able to process the exec calls (backups, internal calls...).

Honestly, setting this a blocks-next instead releasing immediately a fix/release is a terribly wrong decision. The impact of this is pretty big. We are not talking about a RC but a stable release affected for a big issue.

We do not have the option "cpuManagerPolicy: static" configured in our manifest. On the other hand, we have "cpuCFSQuota: false".

For now, I have applied the following patch which need to be configured at every instance group to downgrade runc to 1.1.2:

$ kops edit ig --name=${KOPS_CLUSTER_NAME} nodes-test-2
...
spec:
  additionalUserData:
  - content: |
      #!/bin/sh
      echo "xdowngradecrun.sh: rolling back to runc 1.1.2"
      sudo /usr/bin/wget -q https://github.com/opencontainers/runc/releases/download/v1.1.2/runc.amd64 -O /usr/sbin/runc
      sudo chmod 755 /usr/sbin/runc
    name: xdowngradecrun.sh
    type: text/x-shellscript

It is still on testing. I changed the runc manually in a node yesterday and could exec into a container after restarting it. On the other hand, this morning I couldn't again exec into the test container. I am going to apply this change globally in the cluster to get the proxy containers running using the right runc version and check if this fix works for now until a new kops release get this solved.

IMHO, releasing a new kops 1.24 version downgrading runc should be done immediately. Additionally, I would like to encourage you to add a 'exec' call as part of the tests done before accepting a new release to be promoted as stable.

Thank you for your guidance looking for the root issue, I will update this comment if this solution keeps stable for more than 24h.

olemarkus commented 2 years ago

It's only with certain configurations that kubectl exec fails. We do a number of exec tests as part of the kubernetes e2e test suite (e.g https://testgrid.k8s.io/kops-versions#kops-aws-k8s-latest). We test a fairly huge amont of configurations, but all permutations is not possible. When non-standard configurations are used, we highly encourage testing the betas. As this issue made it into your production environments as well, I am sure you can appreciate how hard it is to catch such issues.

nullzone commented 2 years ago

I will always appreciate the hard work you all make!

That being said, I would still recommend a new release now that 2 cases have been discovered:

cpuCFSQuota: false - very common in AWS and recommended in some situations to improve the performance.
cpuManagerPolicy: static - even more common tho

olemarkus commented 2 years ago

In 1.24 branch you now have the ability to configure runc version. See https://kops.sigs.k8s.io/cluster_spec/#runc-version-and-packages

You can get the latest build at this location: $(curl https://storage.googleapis.com/k8s-staging-kops/kops/releases/markers/release-1.24/latest-ci.txt)/linux/amd64/kops Chane OS/arch as appropriate. Please test if you have the chance.

Right now, we believe there will be another 1-2 weeks before we do a stable release.

jim-barber-he commented 2 years ago

runc version 1.1.4 was released a few hours ago. :excited:

kubernetes / kops

"cpumanagerpolicy: static" breaks on kops 1.24.0 #14007