kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.96k stars 4.65k forks source link

"kops update cluster" panics while creating JWKS for OIDC #14174

Closed seh closed 2 years ago

seh commented 2 years ago

1. What kops version are you running?

Client version: 1.24.1 (git-v1.24.1)

2. What Kubernetes version are you running?

Starting with version 1.19.9, upgrading to version 1.21.14.

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops replace --filename=cluster.yaml
kops update cluster --yes

5. What happened after the commands executed?

It appears that kops update cluster fails when it panics preparing for publishing OIDC Discovery documents to an S3 bucket:

W0824 08:36:09.215431   12986 external_access.go:39] KubernetesAPIAccess is empty
I0824 08:36:10.848893   12986 executor.go:111] Tasks: 0 done / 393 total; 110 can run
I0824 08:36:11.805410   12986 executor.go:111] Tasks: 110 done / 393 total; 83 can run
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x108 pc=0x3ec3a7c]

goroutine 1531 [running]:
k8s.io/kops/pkg/model.(*OIDCKeys).Open(0xc0017a32a0?)
        k8s.io/kops/pkg/model/issuerdiscovery.go:134 +0x21c
k8s.io/kops/upup/pkg/fi.CopyResource({0x5c5c140, 0xc00104de60}, {0x5c61d40?, 0xc00108c7c8?})
        k8s.io/kops/upup/pkg/fi/resources.go:85 +0x72
k8s.io/kops/upup/pkg/fi.ResourceAsBytes({0x5c61d40, 0xc00108c7c8})
        k8s.io/kops/upup/pkg/fi/resources.go:112 +0x4c
k8s.io/kops/upup/pkg/fi/fitasks.(*ManagedFile).Render(0x5?, 0x0?, 0xc00168b740?, 0xc000ee9240, 0x2?)
        k8s.io/kops/upup/pkg/fi/fitasks/managedfile.go:154 +0x70
reflect.Value.call({0x4f99d60?, 0xc000ee9240?, 0x4?}, {0x537191a, 0x4}, {0xc000c28c60, 0x4, 0x5c91ab0?})
        reflect/value.go:556 +0x845
reflect.Value.Call({0x4f99d60?, 0xc000ee9240?, 0x53a8fcc?}, {0xc000c28c60, 0x4, 0x4})
        reflect/value.go:339 +0xbf
k8s.io/kops/upup/pkg/fi.(*Context).Render(0xc0011ec0a0, {0x5c633a0?, 0x0}, {0x5c633a0?, 0xc000ee9240}, {0x5c633a0?, 0xc00171f440})
        k8s.io/kops/upup/pkg/fi/context.go:225 +0xf2e
k8s.io/kops/upup/pkg/fi.DefaultDeltaRunMethod({0x5c633a0?, 0xc000ee9240}, 0xc0011ec0a0)
        k8s.io/kops/upup/pkg/fi/default_methods.go:82 +0x46c
k8s.io/kops/upup/pkg/fi/fitasks.(*ManagedFile).Run(0xc0008a2c18?, 0x0?)
        k8s.io/kops/upup/pkg/fi/fitasks/managedfile.go:109 +0x26
k8s.io/kops/upup/pkg/fi.(*executor).forkJoin.func1(0xc00146d8f0, 0x4)
        k8s.io/kops/upup/pkg/fi/executor.go:187 +0x1ea
created by k8s.io/kops/upup/pkg/fi.(*executor).forkJoin
        k8s.io/kops/upup/pkg/fi/executor.go:183 +0x86

It appears to be failing on this line in file pkg/model/issuerdiscovery.go, trying to access a public key in memory.

6. What did you expect to happen?

kops update cluster would publish all the OIDC Discovery documents to S3, and continue on with the rest of its tasks.

7. Please provide your cluster manifest.

cluster.yaml file ```yaml apiVersion: kops.k8s.io/v1alpha2 kind: Cluster metadata: name: my-cluster.example.com spec: additionalSans: - api.internal.my-cluster.example.com api: loadBalancer: additionalSecurityGroups: - sg-005e2b9c6ffed8582 class: Network crossZoneLoadBalancing: true type: Public authorization: rbac: {} certManager: enabled: true managed: false cloudConfig: disableSecurityGroupIngress: true cloudProvider: aws clusterAutoscaler: balanceSimilarNodeGroups: true enabled: true configBase: s3://my-kops-state/my-cluster.example.com etcdClusters: - etcdMembers: - instanceGroup: master-us-east-2a name: a - instanceGroup: master-us-east-2b name: b - instanceGroup: master-us-east-2c name: c manager: env: - name: ETCD_LISTEN_METRICS_URLS value: http://0.0.0.0:8081 - name: ETCD_METRICS value: extensive name: main - etcdMembers: - instanceGroup: master-us-east-2a name: a - instanceGroup: master-us-east-2b name: b - instanceGroup: master-us-east-2c name: c manager: env: - name: ETCD_LISTEN_METRICS_URLS value: http://0.0.0.0:8082 - name: ETCD_METRICS value: basic name: events iam: allowContainerRegistry: true legacy: false kubeAPIServer: featureGates: EphemeralContainers: "true" kubeDNS: provider: KubeDNS kubeProxy: enabled: false kubelet: anonymousAuth: false featureGates: EphemeralContainers: "true" kubeReserved: cpu: 750m memory: .75Gi kubernetesVersion: 1.21.14 metricsServer: enabled: true networkCIDR: 10.3.0.0/16 networkID: vpc-087cd3eb3bf613986 networking: calico: bpfEnabled: true crossSubnet: true encapsulationMode: vxlan typhaReplicas: 3 nonMasqueradeCIDR: 100.64.0.0/10 serviceAccountIssuerDiscovery: discoveryStore: s3://my-kops-oidc-discovery/my-cluster enableAWSOIDCProvider: true sshAccess: - 184.74.210.37/32 - 184.74.210.38/32 - 207.141.66.101/32 - 207.141.66.99/32 - 212.187.232.28/32 - 212.187.232.29/32 - 4.53.131.109/32 - 4.53.131.110/32 - 4.71.99.125/32 - 4.71.99.126/32 subnets: - cidr: 10.3.100.0/22 id: subnet-0cd20dfb64345dede name: utility-us-east-2a type: Utility zone: us-east-2a - cidr: 10.3.104.0/22 id: subnet-0657e2c2163960a79 name: utility-us-east-2b type: Utility zone: us-east-2b - cidr: 10.3.108.0/22 id: subnet-013e44ade2633a1b1 name: utility-us-east-2c type: Utility zone: us-east-2c - cidr: 10.3.0.0/22 egress: nat-06a85bf97c4a5b65d id: subnet-0ca2f5a3ab50e538e name: us-east-2a type: Private zone: us-east-2a - cidr: 10.3.4.0/22 egress: nat-054d637847b63ea36 id: subnet-047a72902591ebe60 name: us-east-2b type: Private zone: us-east-2b - cidr: 10.3.8.0/22 egress: nat-0df765ca07bb44f0f id: subnet-051d2325bcab67fa6 name: us-east-2c type: Private zone: us-east-2c topology: dns: type: Public masters: private nodes: private updatePolicy: external ```

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

Here is the kops update cluster output at verbosity level ten, just before the failure:

I0824 08:54:41.457895   13698 executor.go:186] Executing task "MirrorSecrets/mirror-secrets": *fitasks.MirrorSecrets {"Name":"mirror-secrets","Lifecycle":"Sync","MirrorPath":{}}
I0824 08:54:41.461003   13698 request_logger.go:45] AWS request: ec2/DescribeSecurityGroups
I0824 08:54:41.461652   13698 request_logger.go:45] AWS request: iam/GetInstanceProfile
I0824 08:54:41.462143   13698 request_logger.go:45] AWS request: ec2/DescribeSubnets
I0824 08:54:41.462148   13698 request_logger.go:45] AWS request: iam/GetInstanceProfile
I0824 08:54:41.463820   13698 request_logger.go:45] AWS request: iam/ListAttachedRolePolicies
I0824 08:54:41.472058   13698 request_logger.go:45] AWS request: iam/GetRolePolicy
I0824 08:54:41.472136   13698 request_logger.go:45] AWS request: ec2/DescribeSubnets
I0824 08:54:41.472541   13698 s3fs.go:329] Reading file "s3://my-kops-oidc-discovery/my-cluster/openid/v1/jwks"
I0824 08:54:41.472820   13698 request_logger.go:45] AWS request: iam/GetRolePolicy
I0824 08:54:41.473050   13698 request_logger.go:45] AWS request: ec2/DescribeInternetGateways
I0824 08:54:41.473944   13698 request_logger.go:45] AWS request: ec2/DescribeSubnets
I0824 08:54:41.474061   13698 request_logger.go:45] AWS request: ec2/DescribeSecurityGroups
I0824 08:54:41.473806   13698 request_logger.go:45] AWS request: iam/ListAttachedRolePolicies
I0824 08:54:41.473914   13698 request_logger.go:45] AWS request: iam/GetRolePolicy
I0824 08:54:41.510755   13698 request_logger.go:45] AWS request: elasticloadbalancing/DescribeTargetGroups
panic: runtime error: invalid memory address or nil pointer dereference

Earlier, I see this pertinent log message:

I0824 08:54:41.459055   13698 executor.go:186] Executing task "ManagedFile/keys.json": *fitasks.ManagedFile {"Name":"keys.json","Lifecycle":"Sync","Base":"s3://my-kops-oidc-discovery/my-cluster","Location":"openid/v1/jwks","Contents":{"SigningKey":{"Name":"service-account","alternateNames":null,"Lifecycle":"Sync","Signer":null,"subject":"cn=service-account","issuer":"","type":"ca","oldFormat":false}},"Public":true}

Note that at present, the aforementioned S3 bucket exists, but there is no existing object with the path my-cluster/openid/v1/jwks.

9. Anything else do we need to know?

I have been able to upgrade clusters and activate the "spec.serviceAccountIssuerDiscovery.enableAWSOIDCProvider" field's behavior successfully with earlier versions of kOps, which wrote the S3 object as necessary. This version of kOps appears to be failing before it can create this S3 object. kOps was able to create the my-cluster/.well-known/openid-configuration object in the same bucket.

See #13353 for what looks to be an earlier report of a similar defect.

See the prior discussion in the "kops-users" channel of the "Kubernetes" Slack workspace.

/kind bug

seh commented 2 years ago

It turns out that the KeysetItem.Certificate field is nil in all but the last two items in my key set. I added some output to (*OIDCKeys).Open. It reports the following:

Number of keys in key set:  7
Key set item "6702426753028327577194087677": &{6702426753028327577194087677 <nil> <nil> 0xc001200b50}
  (ID: "6702426753028327577194087677", distrust timestamp <nil>, certificate: <nil>, private key: &{0xc000fd92c0})
Key set item "6717351783746805535929340772": &{6717351783746805535929340772 <nil> <nil> 0xc001200b90}
  (ID: "6717351783746805535929340772", distrust timestamp <nil>, certificate: <nil>, private key: &{0xc000fd9440})
Key set item "6724755564554290971271764485": &{6724755564554290971271764485 <nil> <nil> 0xc001200bd0}
  (ID: "6724755564554290971271764485", distrust timestamp <nil>, certificate: <nil>, private key: &{0xc000fd9500})
Key set item "6725145319226802661715703465": &{6725145319226802661715703465 <nil> <nil> 0xc001200c10}
  (ID: "6725145319226802661715703465", distrust timestamp <nil>, certificate: <nil>, private key: &{0xc000fd95c0})
Key set item "6727272810098431180443208693": &{6727272810098431180443208693 <nil> <nil> 0xc001200c50}
  (ID: "6727272810098431180443208693", distrust timestamp <nil>, certificate: <nil>, private key: &{0xc001508180})
Key set item "6727329898571771312485446625": &{6727329898571771312485446625 <nil> 0xc000afc000 0xc001200d00}
  (ID: "6727329898571771312485446625", distrust timestamp <nil>, certificate: &{CN=kubernetes-master false 0xc00206c580 0xc001200cb0}, private key: &{0xc0015084e0})
Key set item "6906097667750333366645304518": &{6906097667750333366645304518 <nil> 0xc000afc120 0xc001200e90}
  (ID: "6906097667750333366645304518", distrust timestamp <nil>, certificate: &{CN=service-account true 0xc00206cb00 0xc001200e20}, private key: &{0xc001508660})
seh commented 2 years ago

If I add the following guard condition to (*OIDCKeys).Open, it looks like it will filter the key set items down to just those that contain a certificate for the common name "service-account":

        if item.Certificate == nil || item.Certificate.Subject.CommonName != "service-account" {
            continue
        }

Does that preserve all the items that this method was expecting to consume?

seh commented 2 years ago

Note that the kops get keypairs subcommand fails similarly, due to assuming that every key set item contains an X.509 certificate.

% kops get keypairs
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x100 pc=0x42a9788]

goroutine 1 [running]:
main.listKeypairs({0x5c80b38?, 0xc00066d620?}, {0x7dfba58, 0x0, 0x25?}, 0x0)
        k8s.io/kops/cmd/kops/get_keypairs.go:127 +0x2e8
main.RunGetKeypairs({0x5c7d260, 0xc0000520e8}, {0x5c61a80?, 0xc000c09080?}, {0x5c639c0?, 0xc00000e018?}, 0xc0008b6270)
        k8s.io/kops/cmd/kops/get_keypairs.go:174 +0xf8
main.NewCmdGetKeypairs.func3(0xc000e65680?, {0x7dfba58?, 0x0?, 0x0?})
        k8s.io/kops/cmd/kops/get_keypairs.go:78 +0x3e
github.com/spf13/cobra.(*Command).execute(0xc000e65680, {0x7dfba58, 0x0, 0x0})
        github.com/spf13/cobra@v1.5.0/command.go:872 +0x694
github.com/spf13/cobra.(*Command).ExecuteC(0x7da7c00)
        github.com/spf13/cobra@v1.5.0/command.go:990 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v1.5.0/command.go:918
main.Execute()
        k8s.io/kops/cmd/kops/root.go:95 +0x5c
main.main()
        k8s.io/kops/cmd/kops/main.go:20 +0x17
hakman commented 2 years ago

@seh Would you like to continue to iterate on the fix?

seh commented 2 years ago

Would you like to continue to iterate on the fix?

Yes, though it would help to hear whether or not these entries that lack certificates are valid. Can kOps use them for anything? Should I ignore them as if they were distrusted?

olemarkus commented 2 years ago

Ignore them as distrusted, but list them and make them deletable, I would say.

johngmyers commented 2 years ago

I guess I didn't research far back enough in the history of the keystore code.

kOps can't use a private key without a certificate for anything unless/until it generates a corresponding certificate. (Though for service-account keypairs the only part of the certificate it uses is the public key.)

These days all code paths that create a key also create a corresponding certificate. I would agree that keys without certificates should be ignored as if distrusted.

olemarkus commented 2 years ago

As this is not a regression or something that breaks things for a lot of users, I removed the blocks-next label.