nebari-dev / nebari

🪴 Nebari - your open source data science platform
https://nebari.dev
BSD 3-Clause "New" or "Revised" License
282 stars 93 forks source link

Add config option to enable the encryption of AWS EKS secrets #2788

Closed joneszc closed 2 weeks ago

joneszc commented 1 month ago

Reference Issues or PRs

Fixes #2681 Fixes #2746 Modifies PR#2723 (Failing Tests / Pytest) Modifies PR#2752 (Failing Tests / Pytest)

What does this implement/fix?

Put a x in the boxes that apply

Testing

How to test this PR?

Any other comments?

Allows user to set EKS encryption of secrets by specifying a KMS key ARN in nebari-config.yaml

amazon_web_services:
  eks_kms_arn: 'arn:aws:kms:us-east-1:010101010:key/3xxxxxxx-xxxxx-xxxxx-xxxxx'
image

The KMS key must meet the following conditions:

viniciusdc commented 1 month ago

@joneszc, there are two PRs which seem to add the same thing, this one and #2752 -- I assume the first one was the original; can you close this one? (or move any relevant changes back to the other PR?)

dcmcand commented 1 month ago

@joneszc can we close #2752 and #2723 since we have this one?

joneszc commented 1 month ago

@joneszc can we close #2752 and #2723 since we have this one?

@dcmcand @viniciusdc Yes, those two PRs were built on forks of the old develop branch that is now main Thanks for help determining that the branch was not the issue causing Pytest failures. #2752 and #2723 can be closed.

joneszc commented 1 month ago

@viniciusdc I've opened PR#537 to update the docs per your request

Also in follow-up to your ask, it appears that re-deploying to set KMS encryption on an existing Nebari EKS Cluster, without previous encryption set, does succeed. However, attempting thereafter to re-deploy to remove the previously set EKS secrets encryption will fail as terraform attempts to delete and rebuild the EKS cluster but cannot due to existing node groups.

viniciusdc commented 1 month ago

However, attempting thereafter to re-deploy to remove the previously set EKS secrets encryption will fail as terraform attempts to delete and rebuild the EKS cluster but cannot due to existing node groups.

Hi @joneszc, thanks for checking that out! I was already expecting it to fail, but I had another thing in mind: they might be connected. Can you post a sanitized output of the terraform error and any error messages you might encounter in the CloudTrail history? I suspect you will find something related to the KMS key in there.

the main reason for this request is to validate if it will be beneficial to have this as an immutable field or, depending on the error, we can add manual steps to the user in our docs to disable it.

joneszc commented 4 weeks ago

However, attempting thereafter to re-deploy to remove the previously set EKS secrets encryption will fail as terraform attempts to delete and rebuild the EKS cluster but cannot due to existing node groups.

Hi @joneszc, thanks for checking that out! I was already expecting it to fail, but I had another thing in mind: they might be connected. Can you post a sanitized output of the terraform error and any error messages you might encounter in the CloudTrail history? I suspect you will find something related to the KMS key in there.

the main reason for this request is to validate if it will be beneficial to have this as an immutable field or, depending on the error, we can add manual steps to the user in our docs to disable it.

@viniciusdc

Nebari output after failed attempt to re-deploy to remove eks cluster's envelope encryption of secrets:

[terraform]:   # module.kubernetes.aws_eks_cluster.main must be replaced
[terraform]: -/+ resource "aws_eks_cluster" "main" {
[terraform]:       ~ arn                       = "arn:aws:eks:us-east-1:<account-id>:cluster/nebari-test-dev" -> (known after apply)
[terraform]:       ~ certificate_authority     = [
[terraform]:           - {
[terraform]:               - data = "<>"
[terraform]:             },
[terraform]:         ] -> (known after apply)
[terraform]:       + cluster_id                = (known after apply)
[terraform]:       ~ created_at                = "2024-10-28 15:25:47.172 +0000 UTC" -> (known after apply)
[terraform]:       - enabled_cluster_log_types = [] -> null
[terraform]:       ~ endpoint                  = "https://0000000000000000000000000.gr7.us-east-1.eks.amazonaws.com" -> (known after apply)
[terraform]:       ~ id                        = "nebari-test-dev" -> (known after apply)
[terraform]:       ~ identity                  = [
[terraform]:           - {
[terraform]:               - oidc = [
[terraform]:                   - {
[terraform]:                       - issuer = "https://oidc.eks.us-east-1.amazonaws.com/id/0000000000000000"
[terraform]:                     },
[terraform]:                 ]
[terraform]:             },
[terraform]:         ] -> (known after apply)
[terraform]:         name                      = "nebari-test-dev"
[terraform]:       ~ platform_version          = "eks.17" -> (known after apply)
[terraform]:       ~ status                    = "ACTIVE" -> (known after apply)
[terraform]:         tags                      = {
[terraform]:             "Environment" = "dev"
[terraform]:             "Name"        = "nebari-test-dev"
[terraform]:             "Owner"       = "terraform"
[terraform]:             "Project"     = "nebari-test"
[terraform]:         }
[terraform]:         # (3 unchanged attributes hidden)
[terraform]:
[terraform]:       - access_config {
[terraform]:           - authentication_mode                         = "CONFIG_MAP" -> null
[terraform]:           - bootstrap_cluster_creator_admin_permissions = false -> null
[terraform]:         }
[terraform]:
[terraform]:       - encryption_config { # forces replacement
[terraform]:           - resources = [
[terraform]:               - "secrets",
[terraform]:             ] -> null
[terraform]:
[terraform]:           - provider {
[terraform]:               - key_arn = "arn:aws:kms:us-east-1:<account-id>:key/0000000000000000" -> null
[terraform]:             }
[terraform]:         }
[terraform]:
[terraform]:       - kubernetes_network_config {
[terraform]:           - ip_family         = "ipv4" -> null
[terraform]:           - service_ipv4_cidr = "172.20.0.0/16" -> null
[terraform]:         }
[terraform]:
[terraform]:       ~ vpc_config {
[terraform]:           ~ cluster_security_group_id = "sg-xxxxxxxxxxxxxxxxxx" -> (known after apply)
[terraform]:           ~ vpc_id                    = "vpc-xxxxxxxxxxxxxxxx" -> (known after apply)
[terraform]:             # (5 unchanged attributes hidden)
[terraform]:         }
[terraform]:     }
[terraform]:
[terraform]:   # module.kubernetes.aws_iam_openid_connect_provider.oidc_provider must be replaced
[terraform]: -/+ resource "aws_iam_openid_connect_provider" "oidc_provider" {
[terraform]:       ~ arn             = "arn:aws:iam::<account-id>:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000" -> (known after apply)
[terraform]:       ~ id              = "arn:aws:iam::<account-id>:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000" -> (known after apply)
[terraform]:         tags            = {
[terraform]:             "Environment" = "dev"
[terraform]:             "Name"        = "nebari-test-dev-eks-irsa"
[terraform]:             "Owner"       = "terraform"
[terraform]:             "Project"     = "nebari-test"
[terraform]:         }
[terraform]:       ~ thumbprint_list = [
[terraform]:           - "9e99a48a9960b14926bb7f3b02e22da2b0ab7280",
[terraform]:           - "06b25927c42a721631c1efd9431e648fa62e1e39",
[terraform]:           - "d9fe0a65fa00cabf61f5120d373a8135e1461f15",
[terraform]:           - "7f3682e963aa03a7bcd67f11b0fedae315af49d4",
[terraform]:         ] -> (known after apply)
[terraform]:       ~ url             = "oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000" # forces replacement -> (known after apply) # forces replacement
[terraform]:         # (2 unchanged attributes hidden)
[terraform]:     }
[terraform]:
[terraform]:   # module.kubernetes.aws_iam_policy.cluster_encryption[0] will be destroyed
[terraform]:   # (because index [0] is out of range for count)
[terraform]:   - resource "aws_iam_policy" "cluster_encryption" {
[terraform]:       - arn         = "arn:aws:iam::<account-id>:policy/nebari-test-dev-eks-encryption-policy" -> null
[terraform]:       - description = "IAM policy for EKS cluster encryption" -> null
[terraform]:       - id          = "arn:aws:iam::<account-id>:policy/nebari-test-dev-eks-encryption-policy" -> null
[terraform]:       - name        = "nebari-test-dev-eks-encryption-policy" -> null
[terraform]:       - path        = "/" -> null
[terraform]:       - policy      = jsonencode(
[terraform]:             {
[terraform]:               - Statement = [
[terraform]:                   - {
[terraform]:                       - Action   = [
[terraform]:                           - "kms:ListGrants",
[terraform]:                           - "kms:Encrypt",
[terraform]:                           - "kms:DescribeKey",
[terraform]:                           - "kms:Decrypt",
[terraform]:                         ]
[terraform]:                       - Effect   = "Allow"
[terraform]:                       - Resource = "arn:aws:kms:us-east-1:<account-id>:key/3zzzzzzzzzzzzz"
[terraform]:                     },
[terraform]:                 ]
[terraform]:               - Version   = "2012-10-17"
[terraform]:             }
[terraform]:         ) -> null
[terraform]:       - policy_id   = "ANPARM6PEZIZXIYANUQUT" -> null
[terraform]:       - tags        = {} -> null
[terraform]:       - tags_all    = {} -> null
[terraform]:     }
[terraform]:
[terraform]:   # module.kubernetes.aws_iam_role_policy_attachment.cluster_encryption[0] will be destroyed
[terraform]:   # (because index [0] is out of range for count)
[terraform]:   - resource "aws_iam_role_policy_attachment" "cluster_encryption" {
[terraform]:       - id         = "nebari-test-dev-eks-cluster-role-00000000000000000" -> null
[terraform]:       - policy_arn = "arn:aws:iam::<account-id>:policy/nebari-test-dev-eks-encryption-policy" -> null
[terraform]:       - role       = "nebari-test-dev-eks-cluster-role" -> null
[terraform]:     }
[terraform]:
[terraform]: Plan: 3 to add, 0 to change, 5 to destroy.
[terraform]:
[terraform]: Changes to Outputs:
[terraform]:   ~ cluster_oidc_issuer_url = "https://oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000" -> (known after apply)
[terraform]:   ~ kubernetes_credentials  = (sensitive value)
[terraform]:   ~ oidc_provider_arn       = "arn:aws:iam::<account-id>:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000" -> (known after apply)
[terraform]: local_file.kubeconfig[0]: Destroying... [id=ebb9ba2900716cbac8f3zzzzzzzzzzzzz]
[terraform]: local_file.kubeconfig[0]: Destruction complete after 0s
[terraform]: module.kubernetes.aws_iam_openid_connect_provider.oidc_provider: Destroying... [id=arn:aws:iam::<account-id>:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000]
[terraform]: module.kubernetes.aws_iam_openid_connect_provider.oidc_provider: Destruction complete after 0s
[terraform]: module.kubernetes.aws_eks_cluster.main: Destroying... [id=nebari-test-dev]
[terraform]: module.kubernetes.aws_eks_cluster.main: Still destroying... [id=nebari-test-dev, 10s elapsed]
[terraform]: module.kubernetes.aws_eks_cluster.main: Still destroying... [id=nebari-test-dev, 20s elapsed]
[terraform]: module.kubernetes.aws_eks_cluster.main: Still destroying... [id=nebari-test-dev, 30s elapsed]
[terraform]:
[terraform]: Error: deleting EKS Cluster (nebari-test-dev): operation error EKS: DeleteCluster, https response error StatusCode: 409, RequestID: de8f18ba-0abe-42ae-961f-86d8865fbcf3, ResourceInUseException: Cluster has nodegroups attached
[terraform]: 
[terraform]: 
[terraform]:
 Traceback (most recent call last) «
/home/ssm-user/nebari_private_test/nebari/src/_nebari/subcommands/deploy.py:92 in deploy

89   msg = "Digital Ocean support is currently being deprecated and will be removed
90   typer.confirm(msg)                                                     
91                                                                              
92   deploy_configuration(                                                      
93   config,                                                                
94   stages,                                                                
95   disable_prompt=disable_prompt,                                         

/home/ssm-user/nebari_private_test/nebari/src/_nebari/deploy.py:55 in deploy_configuration

52     s: hookspecs.NebariStage = stage(                                  
53     output_directory=pathlib.Path.cwd(), config=config             
54     )                                                                  
55     stack.enter_context(s.deploy(stage_outputs, disable_prompt))       
56                                                                        
57     if not disable_checks:                                             
58     s.check(stage_outputs, disable_prompt)                         

/usr/lib64/python3.11/contextlib.py:505 in enter_context                                

502   except AttributeError:                                                    
503   raise TypeError(f"'{cls.__module__}.{cls.__qualname__}' object does " 
504   f"not support the context manager protocol") from None
505   result = _enter(cm)                                                       
506   self._push_cm_exit(cm, _exit)                                             
507   return result                                                             
508 

/usr/lib64/python3.11/contextlib.py:137 in __enter__                                    

134   # they are only needed for recreation, which is not possible anymore      
135   del self.args, self.kwds, self.func                                       
136   try:                                                                      
137   return next(self.gen)                                                 
138   except StopIteration:                                                     
139   raise RuntimeError("generator didn't yield") from None                
140 

/home/ssm-user/nebari_private_test/nebari/src/_nebari/stages/infrastructure/__init__.py:961 in deploy

958 def deploy(                                                                   
959   self, stage_outputs: Dict[str, Dict[str, Any]], disable_prompt: bool = False
960 ):                                                                            
961   with super().deploy(stage_outputs, disable_prompt):                       
962   with kubernetes_provider_context(                                     
963     stage_outputs["stages/" + self.name]["kubernetes_credentials"]["value"]    
964   ):                                                                    

/usr/lib64/python3.11/contextlib.py:137 in __enter__                                    

134   # they are only needed for recreation, which is not possible anymore      
135   del self.args, self.kwds, self.func                                       
136   try:                                                                      
137   return next(self.gen)                                                 
138   except StopIteration:                                                     
139   raise RuntimeError("generator didn't yield") from None                
140 

/home/ssm-user/nebari_private_test/nebari/src/_nebari/stages/base.py:298 in deploy      

295   deploy_config["terraform_import"] = True                              
296   deploy_config["state_imports"] = state_imports                        
297                                                                             
298   self.set_outputs(stage_outputs, terraform.deploy(**deploy_config))        
299   self.post_deploy(stage_outputs, disable_prompt)                           
300   yield                                                                     
301 

/home/ssm-user/nebari_private_test/nebari/src/_nebari/provider/terraform.py:71 in deploy

 68     )                                                                 
 69                                                                             
 70   if terraform_apply:                                                       
 71   apply(directory, var_files=[f.name])                                  
 72                                                                             
 73   if terraform_destroy:                                                     
 74   destroy(directory, var_files=[f.name])

/home/ssm-user/nebari_private_test/nebari/src/_nebari/provider/terraform.py:153 in apply

150   + ["-var-file=" + _ for _ in var_files]                                   
151 )                                                                             
152 with timer(logger, "terraform apply"):                                        
153   run_terraform_subprocess(command, cwd=directory, prefix="terraform")      
154 
155
156 def output(directory=None):

/home/ssm-user/nebari_private_test/nebari/src/_nebari/provider/terraform.py:119 in run_terraform_subprocess                                                                

116 logger.info(f" terraform at {terraform_path}")                                
117 exit_code, output = run_subprocess_cmd([terraform_path] + processargs, **kwargs)
118 if exit_code != 0:                                                            
119   raise TerraformException("Terraform returned an error")                   
120 return output                                                                 
121 
122 
TerraformException: Terraform returned an error

Additional Error details from CloudTrail:

image
dcmcand commented 3 weeks ago

So @joneszc am I reading that correctly that enabling this option will destroy and replace your cluster? We should probably go ahead and make this field immutable then. We definitely don't want anyone accidentally destroying their deploy. The docs should reflect that this should only be used on fresh deploys too.