kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.97k stars 4.65k forks source link

Automated cherry pick of #16778: dns: don't use IMDS region resolver when it previously failed #16781

Closed rifelpet closed 2 months ago

rifelpet commented 2 months ago

Cherry pick of #16778 on release-1.30.

16778: dns: don't use IMDS region resolver when it previously failed

For details on the cherry pick process, see the cherry pick requests page.

rifelpet commented 2 months ago

/hold waiting to confirm this fixes the cluster creation of these jobs:

https://testgrid.k8s.io/kops-misc#kops-aws-external-dns

https://testgrid.k8s.io/kops-misc#kops-aws-pod-identity-webhook

https://testgrid.k8s.io/kops-misc#kops-aws-apiserver-nodes

k8s-ci-robot commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hakman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes/kops/blob/release-1.30/OWNERS)~~ [hakman] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
justinsb commented 2 months ago

For kops-aws-pod-identity-webhook, I think this is the error:

Error: error running tasks: deadline exceeded executing task ManagedFile/discovery.json. Example error: error creating ManagedFile ".well-known/openid-configuration": error writing s3://k8s-kops-ci-prow-state-store/e2e-49d63c55eb-ac683.tests-kops-aws.k8s.io/.well-known/openid-configuration (with ACL="public-read"): operation error S3: PutObject, https response error StatusCode: 400, RequestID: M0XPHJR0EZRXMPXD, HostID: d+e5r1Lc7v47eYOlLjCTJe8L/r5lmPJ6gvuY7nNY2pEU1wlEN+3LJayd33Os1KQFoddackAfxeE=, api error AccessControlListNotSupported: The bucket does not allow ACLs

So ACLs on the S3 bucket - I think we try to detect that, so maybe that's going wrong.


For https://testgrid.k8s.io/kops-misc#kops-aws-apiserver-nodes, we're getting further, but the most recent test failed to connect to the cluster and then could not SSH to the VMs to dump more info. It almost looked like they didn't come up at all, which is ... odd. On the toolbox dump the instances do have state=running (but I can't easily tell for how long). We do create the cluster from a template here, so it's not unlikely that something else is going wrong here.

justinsb commented 2 months ago

(I have no objection to delaying the backport by the way, while we figure this out!)

rifelpet commented 2 months ago

@justinsb this is my main concern with merging this as-is. I think a follow up is needed:

https://github.com/kubernetes/kops/pull/16778#issuecomment-2310012648

rifelpet commented 2 months ago

The remaining failures are unrelated, so this should be safe to merge

/unhold