kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.84k stars 4.64k forks source link

AWS sts:AssumeRole stopped working with role/OrganizationAccountAccessRole in 1.30.x #16849

Open vitaliyf opened 23 hours ago

vitaliyf commented 23 hours ago

/kind bug

1. What kops version are you running? The command kops version, will display this information.

Testing upgrade from Client version: 1.29.2 (git-v1.29.2) to Client version: 1.30.1 (git-v1.30.1)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

v1.29.9

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops_v1.30.1 update cluster - no other changes to manifest or environment, only executing newer kops binary.

5. What happened after the commands executed?

$ export AWS_PROFILE=company-name-dev3 $ kops_v1.30.1 update cluster

SDK 2024/09/20 14:31:06 DEBUG request failed with unretryable error https response error StatusCode: 403, RequestID: 623bd87e-11e1-4b06-9f16-10f60ba2f030, api error AccessDenied: User: arn:aws:sts::[redacted]006:assumed-role/OrganizationAccountAccessRole/aws-go-sdk-1726842666098977639 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::[redacted]006:role/OrganizationAccountAccessRole Error: error determining default DNS zone: error querying zones: error listing hosted zones: operation error Route 53: ListHostedZones, get identity: get credentials: failed to refresh cached credentials, operation error STS: AssumeRole, https response error StatusCode: 403, RequestID: 623bd87e-11e1-4b06-9f16-10f60ba2f030, api error AccessDenied: User: arn:aws:sts::[redacted]006:assumed-role/OrganizationAccountAccessRole/aws-go-sdk-1726842666098977639 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::[redacted]006:role/OrganizationAccountAccessRole

6. What did you expect to happen?

With kops-1.29.2 the output shows proposed changes that need to be applied with --yes

AWS CLI is able to successfully get Route53 zones from the same shell:

$ aws route53 list-hosted-zones
{
    "HostedZones": [
        {
            "Id": "/hostedzone/Z0[redacted]",
            "Name": "k8s.dev3.us-west-2.example.com.",
            "CallerReference": "8e483d8f-0d3c-4bcc-9c68-ecb4dea807ae",
            "Config": {
                "PrivateZone": false
            },
            "ResourceRecordSetCount": 8
        }
}

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

https://gist.github.com/vitaliyf/cfddd9ad771ee613ee850bb9e2d3fe14

9. Anything else do we need to know?

$ cat ~/.aws/config
[default]
region = us-west-2

[profile company-name]
aws_account_id = company-name
region = us-west-2
output = json
color = ff0000

[profile company-name-dev1]
role_arn = arn:aws:iam::[redacted]385:role/OrganizationAccountAccessRole
source_profile = company-name

[profile company-name-dev2]
role_arn = arn:aws:iam::[redacted]813:role/OrganizationAccountAccessRole
source_profile = company-name
color = 00ff00

[profile company-name-dev3]
role_arn = arn:aws:iam::[redacted]006:role/OrganizationAccountAccessRole
source_profile = company-name
color = 0000ff

This cluster has been continuously upgraded one kops/kubernetes version at a time for at least a couple years, so it is pretty routine for us to test and execute such upgrades in-place.

I tried to look around and I suspect this is related to aws-sdk-go-v2 upgrade.

For example, they have this issue: https://github.com/aws/aws-sdk-go-v2/issues/2686 - and coincidentally or not, that ticket is referenced by https://github.com/cert-manager/cert-manager/pull/7236 where they are also dealing with "Missing Region" error just like https://github.com/kubernetes/kops/issues/16645 from kops-1.30.0

vitaliyf commented 23 hours ago

Workaround: use awsudo or other workarounds from https://kops.sigs.k8s.io/mfa/#the-workaround-2

$ awsudo company-name-dev3 kops_v1.30.1 update cluster

...
                            + NODEUP_URL_AMD64=https://artifacts.k8s.io/binaries/kops/1.30.1/linux/amd64/nodeup,https://github.com/kubernetes/kops/releases/download/v1.30.1/nodeup-linux-amd64
                            - NODEUP_URL_AMD64=https://artifacts.k8s.io/binaries/kops/1.29.2/linux/amd64/nodeup,https://github.com/kubernetes/kops/releases/download/v1.29.2/nodeup-linux-amd64
...more as-expected output..

Must specify --yes to apply changes