kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.94k stars 4.65k forks source link

Instance IAM role doesn't have access to the KMS key used to encrypt S3 State Store #5532

Closed tavisma closed 5 years ago

tavisma commented 6 years ago

1. What kops version are you running? The command kops version, will display this information. Version 1.10.0-beta.1 (git-dc9154528)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:17:47Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using? aws 4. What commands did you run? What is the simplest way to reproduce this issue? aws s3api create-bucket \ --region "us-west-2" \ --create-bucket-configuration LocationConstraint="us-west-2" \ --bucket "<REDACTED>" \ --acl "private"

aws s3api put-bucket-versioning \ --region "us-west-2" \ --bucket "<REDACTED>" \ --versioning-configuration Status=Enabled

aws s3api put-bucket-encryption \ --region "us-west-2" \ --bucket "<REDACTED>" \ --server-side-encryption-configuration '{ "Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms", "KMSMasterKeyID": "arn:aws:kms:<REDACTED>"}}]}'

5. What happened after the commands executed? kops-configuration.service was unable to access the state files stored in the s3 bucket

` systemctl status kops-configuration.service ● kops-configuration.service - Run kops bootstrap (nodeup) Loaded: loaded (/etc/systemd/system/kops-configuration.service; disabled; vendor preset: disabled) Active: activating (start) since Fri 2018-07-27 00:07:33 UTC; 3min 49s ago Docs: https://github.com/kubernetes/kops Main PID: 881 (nodeup) Tasks: 6 (limit: 32767) Memory: 274.8M CGroup: /system.slice/kops-configuration.service └─881 /var/cache/kubernetes-install/nodeup --conf=/var/cache/kubernetes-install/kube_env.yaml --v=8

Jul 27 00:11:12 ip-10-65-129-161.ec2.internal nodeup[881]: I0727 00:11:12.540408 881 assetstore.go:313] added asset "ptp" for &{"/var/cache/nodeup/extracted/sha1:REDACTEDhtt> Jul 27 00:11:12 ip-10-65-129-161.ec2.internal nodeup[881]: I0727 00:11:12.540429 881 assetstore.go:313] added asset "sample" for &{"/var/cache/nodeup/extracted/sha1:REDACTED> Jul 27 00:11:12 ip-10-65-129-161.ec2.internal nodeup[881]: I0727 00:11:12.540448 881 assetstore.go:313] added asset "tuning" for &{"/var/cache/nodeup/extracted/sha1:REDACTED_> Jul 27 00:11:12 ip-10-65-129-161.ec2.internal nodeup[881]: I0727 00:11:12.540468 881 assetstore.go:313] added asset "vlan" for &{"/var/cache/nodeup/extracted/sha1:REDACTED_ht> Jul 27 00:11:12 ip-10-65-129-161.ec2.internal nodeup[881]: I0727 00:11:12.541551 881 files.go:100] Hash matched for "/var/cache/nodeup/sha1:REDACTED_https___kubeupv2_s3_amazo> Jul 27 00:11:12 ip-10-65-129-161.ec2.internal nodeup[881]: I0727 00:11:12.541573 881 assetstore.go:203] added asset "utils.tar.gz" for &{"/var/cache/nodeup/sha1:REDACTED> Jul 27 00:11:12 ip-10-65-129-161.ec2.internal nodeup[881]: I0727 00:11:12.541663 881 assetstore.go:313] added asset "socat" for &{"/var/cache/nodeup/extracted/sha1:REDACTED> Jul 27 00:11:12 ip-10-65-129-161.ec2.internal nodeup[881]: I0727 00:11:12.541694 881 s3fs.go:216] Reading file "s3:///cluster.spec" Jul 27 00:11:12 ip-10-65-129-161.ec2.internal nodeup[881]: W0727 00:11:12.961693 881 main.go:142] got error running nodeup (will retry in 30s): error loading Cluster " Jul 27 00:11:12 ip-10-65-129-161.ec2.internal nodeup[881]: status code: 403, request id: `

Manually providing the IAM roles created by kops access to the KMS key used to encrypt the S3 bucket allows the kops-configuration.service to start and the cluster to boot

6. What did you expect to happen? It seems that when encryption is used in the S3 bucket used for KOPS_STATE_STORE, the nodes are not given access to the encryption key used in the bucket I didn't encounter any problem with kops-1.9.1 Cluster was created using '--target=terraform'

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

tavisma commented 6 years ago

I made a much simpler cluster without all our custom bits and was able to reproduce this problem again:

aws s3api create-bucket --region "us-west-2" --create-bucket-configuration LocationConstraint="us-west-2" --bucket "<REDACTED>" --acl "private"

aws s3api put-bucket-versioning --region "us-west-2" --bucket "<REDACTED>" --versioning-configuration Status=Enabled

aws s3api put-bucket-encryption --region "us-west-2" --bucket "<REDACTED>" --server-side-encryption-configuration '{ "Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms", "KMSMasterKeyID": "<REDACTED>"}}]}'

export NAME=<REDACTED> export KOPS_STATE_STORE=s3://<REDACTED>

kops create cluster --cloud=aws --cloud-labels='<REDACTED>' --channel=alpha --kubernetes-version=1.10.6 --node-count=1 --zones=${NODE_AZS} --dns-zone=<REDACTED> --node-size=m5.xlarge --master-size=m5.large --master-count=1 --networking=weave --topology=private --authorization=RBAC --associate-public-ip=false --admin-access=${BASTION_TRUSTED_IPS} --ssh-access=${INTERNAL_TRUSTED_IPS} --api-loadbalancer-type=internal --master-volume-size=128 --master-security-groups=${BASTION_SECURITY_GROUP} --node-volume-size=128 --node-security-groups=${BASTION_SECURITY_GROUP} --encrypt-etcd-storage --image=595879546273/CoreOS-stable-1800.4.0-hvm --vpc=$VPC_ID --name=${NAME} --network-cidr=${NETWORK_CIDR} --subnets=${PRIVATE_SUBNETS} --utility-subnets=${PUBLIC_SUBNETS} --dry-run -oyaml > cluster.yaml

kops create -f cluster.yaml kops update cluster ${NAME} --target=terraform --out=. --yes terraform apply

After all this, sshing into a master and running systemctl status kops-configuration.service shows it was not able to download cluster.spec from the S3 bucket containing state due to having no access to the KMS key (adding access to the key manually allows everything to start up properly)

lukyanetsv commented 6 years ago

Hi, i had the same issue. Fixed for me by modifying masters policy.

{
      "Sid": "kopsK8sKMSEncrypted",
      "Effect": "Allow",
      "Action": [
        "kms:CreateGrant",
        "kms:Decrypt",
        "kms:DescribeKey",
        "kms:Encrypt",
        "kms:GenerateDataKey*",
        "kms:ReEncrypt*"
      ],
      "Resource": [
        "arn:aws:kms:eu-central-1:XXXXXX:key/f75fbbe1-YYY-YYYY-YYYY-ZZZZZZZZ"
      ]
    },
tavisma commented 6 years ago

Nodes need the policy update too

waldher commented 6 years ago

For those looking for a quick fix to this issue, using @lukyanetsv's policy in your cluster configuration as follows will work. Ensure that you update the ARN for your KMS key:

spec:
  additionalPolicies:
    master: |
      [
        {
          "Sid": "kopsK8sKMSEncrypted",
          "Effect": "Allow",
          "Action": [
            "kms:CreateGrant",
            "kms:Decrypt",
            "kms:DescribeKey",
            "kms:Encrypt",
            "kms:GenerateDataKey*",
            "kms:ReEncrypt*"
          ],
          "Resource": [
            "arn:aws:kms:us-east-1:123456789012:key/ee174004-c3b2-4123-9a80-c82f3c70df9d"
          ]
        }
      ]
    node: |
      [
        {
          "Sid": "kopsK8sKMSEncrypted",
          "Effect": "Allow",
          "Action": [
            "kms:CreateGrant",
            "kms:Decrypt",
            "kms:DescribeKey",
            "kms:Encrypt",
            "kms:GenerateDataKey*",
            "kms:ReEncrypt*"
          ],
          "Resource": [
            "arn:aws:kms:us-east-1:123456789012:key/ee174004-c3b2-4123-9a80-c82f3c70df9d"
          ]
        }
      ]
grggls commented 6 years ago

thanks @lukyanetsv and @waldher -- your fix worked perfectly.

+1 for this issue

ggulati2 commented 6 years ago

Will there be a permanent fix for it?

jdn-za commented 6 years ago

Hit by this as well

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

gregkoganvmm commented 5 years ago

/remove-lifecycle stale

Just wanted to make sure this issue is still on the radar. Is there a way to avoid this going forward? The workaround definitely works (thank you @waldher & @lukyanetsv), but seems a bit clunky.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

waldher commented 5 years ago

/remove-lifecycle stale

This issue is still occurring.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kops/issues/5532#issuecomment-540496300): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
denibertovic commented 4 years ago

I just hit this issue. It doesn't seem like this should be closed (even though there is a workaround). It would be great if there was a --kms-key-arn or similar flag that would create the above workaround in the cluster spec for the user.

kilpatty commented 2 years ago

We are also encountering this - I will try to submit a PR this week