nebari-dev / governance

✨ Governance-related work for Nebari-dev
BSD 3-Clause "New" or "Revised" License
0 stars 2 forks source link

RFD - Add Nebari Configuration Options for Air-Gapped or Secure Deployments #52

Open joneszc opened 3 months ago

joneszc commented 3 months ago
Status Draft 🚧 / Open for comments 💬
Author(s) @joneszc
Date Created 15-08-2024
Date Last updated 15-08-2024
Decision deadline N/A

Title

Addition of Nebari Configuration Options for Deploying Nebari in Air-Gapped and/or Secure Environments

Summary

There are currently a number of options available to optimize Nebari configurations for deploying in AWS air-gapped networks, private subnets, or secure environments

For example:

However, configuration options could be expanded to enable:

User benefit

Design Proposal

Alternatives or approaches considered (if any)

In addition to using node pre-bootstrap commands to override containerd configs for setting private registry mirrors, increased terraform and helm override options could be enabled to specify container images and tags to reflect custom-built or privately-hosted container-images.

Best practices

User impact

Unresolved questions

tylergraff commented 2 months ago

I propose adding a "Nebari Secure Deployment Guide" to this RFD.

This would take the form of one or more nebari-config.yaml files which each utilize inline comments to comprehensively document configuration parameters relevant to various aspects of security. For example, one config file may demonstrate how to override the default docker container locations with a custom-specified repository. This config file could be named e.g. "nebari-config-custom-docker-repo.yaml".

Another example could specify the aboe docker repository along with AMI IDs and elimination of AWS internet gateway. This config file could be called e.g. "nebari-config-aws-airgap.yaml". These are only examples: further discussion can refine exactly what configuration goes into each yaml file and how the files are named.

These files could then be used in associated CI/CD pipelines to validate that the configuration state they describe continues to be supported by Nebari as new versions are released. I propose that the work to hook up this CI/CD mechanism is not part of this RFD.

I do not have a recommended location for these nebari-config.yaml files yet.

Adam-D-Lewis commented 2 months ago

Somewhat related, I know @viniciusdc was working on a way to auto generate documentation from the code. We could potentially add the documentation describing for each new nebari-config setting in the pydantic models themselves. Do you have an issue that shows what you were working on @viniciusdc?

dcmcand commented 2 months ago
* Enable option to control EKS cluster endpoint access settings as discussed in [#2586](https://github.com/nebari-dev/nebari/issues/2586) and proposed in [PR#2618](https://github.com/nebari-dev/nebari/pull/2618):
amazon_web_services:
  eks_endpoint_access: 'private'

I think this is fine and your proposed method in #2618 makes sense.

* Enable option to run custom launch commands on EKS nodes as discussed in [#2603](https://github.com/nebari-dev/nebari/issues/2603) and proposed in [PR#2621](https://github.com/nebari-dev/nebari/pull/2621)
  Note that this would also resolve the functionality to set private container registries/mirrors by adding containerd configs/imports:
amazon_web_services:
  node_prebootstrap_command: |
    #!/bin/bash
    mkdir -p /etc/containerd/certs.d/_default
    cat <<-EOT > /etc/containerd/certs.d/_default/hosts.toml
    [host."https://registry.gitlab.example.com"]
      capabilities = ["pull", "resolve"]
    EOT

I think this syntax would be quite awkward. As much as possible, prebaking stuff like enabling other repos into your AMI would solve this. For running a script on startup, there is the user data approach (reference https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html) Perhaps something like specifying a path to a userdata file or config would work. I have concerns about not only the awkwardness of including the script inline, but it also seems like a potential security footgun.

* Enable option to specify custom AMI IDs for EKS nodes as discussed in [#2604](https://github.com/nebari-dev/nebari/issues/2604) and also proposed in [PR#2621](https://github.com/nebari-dev/nebari/pull/2621)
amazon_web_services:
  node_groups:
    general:
      instance: m5.2xlarge
      custom_ami: ami-0xzxzxzxzxzxzx
      min_nodes: 1
      max_nodes: 1
      gpu: false
      single_subnet: false

Slight nitpick, but rather than custom_ami, we could just call it node_ami, there is no reason it has to be custom, a user may want to just use a vendor provided node ami for some reason.

joneszc commented 2 months ago

@viniciusdc @dcmcand

The original intent of the Add aws_launch_template PR was to make it easy on the user to run commands and customized AmazonLinux AMIs using launch_template/user_data sections, with less crisscrossing of config variable options and more built-in logic to reduce the risk of blocking nodes from joining the cluster (e.g. missing or faulty user-defined bootstrap.sh commands).

Your replacement PR includes the same launch_template/user_data approach as originally proposed in PR and still includes the inline scripting option with the pre_bootstrap_command/node_prebootstrap_command var. The main differences I see between the PRs is that your rendition moves the pre-bootstrap command option to the node_groups level to cut out some previously proposed logic in the terraform, while also requiring more due diligence from the user in mapping additional variables.

When specifying an AMI image_id, your PR will require the user to manually set the bootstrap.sh command (PR#2621 triggers a pre-set bootstrap.sh command as necessary and EKS otherwise sets the bootstrap.sh command automatically when image_id is not specified). You are proposing to call this variable, for the second part of the user_data block, user_data, which seems misleading since the overarching user_data section might already include pre-bootstrap commands and could require a bootstrap.sh command. If you are setting the onus on the user to provide the bootstrap.sh command, please rename the variable to something like override_bootstrap_command to clue the user in on providing the /etc/eks/bootstrap.sh command. Additionally, your PR adds updates to relocate some previously existing logic, from terraform to python, for setting ami_type but falls short of checking to ensure that the ami_type is always CUSTOM when a user specifies an AMI image_id. EKS will fail if ami_type is set to anything other than CUSTOM when setting image_id.

Our original PR#2621 was not engaging users to specify ami_type; rather, the intent was to enable users to customize the default AmazonLinux AMIs for security purposes (e.g. apply STIGs). I'm glad to see you aren't enabling the user free range to set the ami_type, which would expand the scope of this PR to accommodate user_data configuration schema updates for AmazonLinux2023, which transitions from Content-Type: text/x-shellscript; charset="us-ascii" to Content-Type: application/node.eks.aws (YAML). We are anticipating addressing Nebari's AWS migration from AL2 to AL2023 in a separate security Issue/PR--in due course of the upcoming deprecation of AL2.

viniciusdc commented 2 months ago

Hi @joneszc, Thanks for the valuable follow-up. Indeed, I took the liberty of expanding the PR to be a bit more generic in the sense of ami customization. As the draft suggests, that was just a small passthrough to see how the config would be exposed to a user, in which I was already considering the scope of a "security" deployment option or set of settings that lead to that in the future.

The main differences I see between the PRs is that your rendition moves the pre-bootstrap command option to the node_groups level to cut out some previously proposed logic in the terraform, while also requiring more due diligence from the user in mapping additional variables.

It's worth noticing that I've kept your original option to set the launch_template as a global config for all node groups. So, both the node_groups and the aws provider field will have access to that variable, though a node_group.launch_templatestill would have priority over the global one when available.

You are proposing to call this variable for the second part of the user_data block, user_data, which seems misleading since the overarching user_data section might already include pre-bootstrap commands and could require a bootstrap.sh command

I thought about that, and I concur the name needs to be more accurate since it's not exactly what it says it is. Still, at the same time, I will enforce that this can only be passed when the AMI type is set to CUSTOM, and, as mentioned by you as well, that's something that I didn't have the time to add to that PR yet, but it was the original goal.

However, on a counter suggestion, what about override_user_data? While I see the point of using override_bootstrap_command as a good source of direction to the user, I think it's limiting when comparing the broader flexibility of MIME. But I am not strongly opinionated on this.

Our original https://github.com/nebari-dev/nebari/pull/2621 was not engaging users to specify ami_type; rather, the intent was to enable users to customize the default AmazonLinux AMIs for security purposes (e.g. apply STIGs).

That's also different from the PR's intention. The main goal was to clean up the handling logic only from within the Terraform resources; the reason it shows up to users right now is mainly due to a current narrow distinction between what should be passed down as a Terraform variable and what should be allowed in the nebari-config.yaml. They consume from the same model right now, but in theory, they should be separated entities, and that is something I plan to address in another ocasion.

AmazonLinux2023, which transitions from Content-Type: text/x-shellscript; charset="us-ascii" to Content-Type: application/node.eks.aws (YAML). We anticipate addressing Nebari's AWS migration from AL2 to AL2023 in a separate security Issue/PR--in due course of the upcoming deprecation of AL2.

You mentioned this exciting prospect. Right now, in any of the given PRs, we are "hardcoding" that as part of the template file, which, while not harmful, is not preferable. So maybe the best course of action would be to leverage the data_user as a path to a template file and only guarantee a set of variables to these templates, such as certificate, cluster_name, etc.

joneszc commented 2 months ago

@viniciusdc

I could see calling the variable "user_data" if you weren't including the pre_bootstrap_command var. My team's use cases, for which we originally requested these features, are predominantly in favor of the pre-bootstrap command option in conjunction with taking the burden off the Nebari user for setting the /etc/eks/bootstrap.sh command. Again, in our original PR, we included logic to trigger the bootstrap.sh command+args, as is necessary when using a CUSTOM AMI. We pondered adding a bootstrap_extra_args or bootstrap_args_override var but didn't see that as an imminent need of Nebari.

Since you are, in effect, requiring users to manually enter the bootstrap.sh command when setting ami_id--or else facing the pitfall of nodes failing to join the cluster--while also including the pre_bootstrap_command, you are potentially dealing with three chronological parts to your user_data: pre-bootstrap-user-data, bootstrap-user-data, and post-bootstrap-user-data. You could wrap the entirety of the pre-bootstrap + bootstrap.sh-override + post-bootstrap options into a single variable and continue to call it "user_data" or else follow the models of either eksctl, which enables users to enter preBootstrapCommands and/or overrideBootstrapCommand (onus is on the user to ensure bootstrap.sh command is manually entered when specifying an ami-id), or the aws eks user_data terraform submodule, which enables both pre_bootstrap_user_data and post_bootstrap_user_data while also offering the option to set enable_bootstrap_user_data to true/false. The point is, if a Nebari user wants to run a custom AMI, and you don't include the /etc/eks/bootstrap.sh command in the user_data file for them, then they will need to know exactly under which variable to write the boostrap.sh command+args.

tylergraff commented 1 month ago

@viniciusdc @dcmcand I believe this RFD has served its purpose and can be closed. The changes discussed here have been implemented, merged, and are slated for release 2024.9.1