Status	Accepted :white_check_mark:
Author(s)	@costrouc
Date Created	03-28-2023
Date Last updated	03-28-2023
Decision deadline	04-15-2023

Title

Extension Mechanism for Nebari

Summary

Over the past 3 years we have consistently run into the issue where extending and customizing Nebari has been a hard task. Several approaches have been added:

the addition of stages to the nebari deployment to make it easier to isolate pieces and was work that was done to make the extension mechanism easier to move towards
usage of terraform_overrides and helm_overrides keyword to allow for arbitrary overrides of helm values.
helm_extensions in stage 8 which allow for the addition of arbitrary helm charts
tf_extensions which integrate oauth2 and ingress to deploy a single docker image

Despite these features we still have needs from users and we are not addressing them all. Additionally we have issues when we want to add a new services it typically has to be directly added to the core of Nebari. We want to solve this by making extensions first class in Nebari.

User benefit

I see quite a few benifits from this proposal:

easier to extend Nebari making it easier to split development of nebari into smaller teams e.g. core Nebari team, feature-x team
easier customization of stages since the extension mechanism will solidify the interfaces between stages
easier adoption of new ways to deploy stages. Personally excited about this feature since it could make adoption of terraformpy easier.
ad-hoc client customizations will be significantly easier
ways to have proprietary additions to nebari that do not require deep customization

Design Proposal

Overall I propose we adopt pluggy. Pluggy has been adopted by many major projects including: datasette, conda, (TODO list more). Pluggy would allow us to expose a plugin interface and "install" extensions via setuptools entrypoints. Making extension installation as easy as pip install ...

Usage from a high level user standpoint

pip install nebari
pip install nebari-ext-clearml
pip install nebari-ext-helm
pip install nebari-ext-cost

Once a user installs the extensions we can view the installed extensions via:

$ nebari extensions list
Name                       Description
---------------------------------------------------------------------------
nebari-ext-clearml "ClearML integration into nebari"
nebari-ext-helm     "Helm extensions"
....

Plugin Interfaces

Within nebari we will expose several plugins:

Subcommands

A plugin interface for arbitrary additional typer commands. All commands will be passed in the nebari config along with all specified command line arguments from the user. Conda has a similar approach with typer for their system.

nebari cost

Stages

class Stage:
    name: str
    description: str
    priority: str    # defaults to value of name

    def validate(self, config: schema.NeabriConfig):
         """Perform additional validation of the nebari configuration specific to stage

         """

    def render(self, config: schema.NebariConfig) -> typing.Union[typing.Dict[str, bytes], pathlib.Path]:
          """Given a set configuration render a set of files

         Returns
         ------------
         typing.Union[typing.Dict[str, bytes], pathlib.Path]
              Returns either a directory to copy over files or a dictionary of keys mapping to file bytes
         """
         ...

      def deploy(self, directory: pathlib.Path, stages: typing.Dict[str, typing.Any]) -> typing.Any:
            "Deploy all resources within the stage

            "
            ...

      def destroy(self, directory: pathlib.Path):
            "Destroy all resources within the stage"
            ...

Nebari will use pluggy within its core and separate each stage into a pluggy Stage . Each stage will keep it's original name.

Alternatives or approaches considered (if any)

As far as plugin/extension systems go I am only aware of two major ones within the python ecosystem:

pluggy
traitlets :: I have used traitlets on several projects and do not feel it is a good fit here because:
- traitlets is extremely invasive to the codebase it has opinions on the class structure/class creation
- exposes a cli
- opinionated way to perform customization

Best practices

This will encourage the practice of extending nebari via extensions instead of direct PRs to the core.

User impact

It is possible to make this transition seamless to the user without changing behavior.

Unresolved questions

I feel confident in the approach since I have seen other project use pluggy succefully for similar work.

Implementing an extension mechanism to Nebari seems to be a very good solution for companies/peoples that wants to easily build and distribute features on top of Nebari 👍

We, at NaasAI, would like to expose:

Our needs around integrating with Nebari.
Requirements in terms of deployment management.
How we are doing it today
What we think would be a great way to interact with/use the Nebari CLI.

A bit of context

We are a company managing a JupyterHub infrastructure on top of Kubernetes and we have built features on top of JupyterLab to make it easier for users to schedule notebooks execution, share assets generated, build data products, etc ...

Integrating with Nebari seems like a good fit for us as we will be able to leverage what Nebari's team is doing while allowing us to focus on our core business.

Integrating with Nebari

We would like to distribute naas on top of Nebari.

For that, we must be able to customize Nebari's deployment, but also get information about Nebari's deployment to configure our resources accordingly. For example, we need to know:

What is the name of the cluster that was deployed?
In which region it was deployed?
What domain is used to serve nebari.
The name of the IAM role attached to EKS nodes.
...

Our requirements

In terms of infrastructure management, we think, at NaasAI, that Infrastructure as Code is a must, for multiple reasons that I won't explain here (unless you feel it would make this message clearer.), but a lot of companies would rather have an IaC solution than a manual process with a How to that an operator will need to follow (But it's not always the case, of course).

We want to be able to:

Manage multi-cloud deployments.
Manage multi-cluster deployment.
Separate infrastructure environments (dev/staging/preprod/prod/...) without limitations around the number of environments.
Integrate Nebari and Naas deployment into an existing IaC infrastructure.

How we are doing it today

We chose to use Terragrunt to manage our new infrastructure deployment as it seems to be a good fit for most use cases.

We have 3 repositories to manage our infrastructure deployment:

infrastructure-modules: This is the place where we create all our Terraform modules that will later be referenced/used to deploy infrastructure. (Terragrunt infrastructure-module example)
_envcommon: This one is used to have default values for our Terraform modules and also pinpoint specific terraform module versions. We have a configuration for each terraform module. (Terragrunt _envcommon example)
infrastructure-live: This one is used to reference configuration in _encommon and organize deployments dependencies and environments. We have the following logic where we split by environment/cloud_provider/region/project. (Terragrunt infrastructure-live example)

To integrate with Nebari we tested multiple solutions but to be able to use our Terragrunt infrastructure we chose to create a Terraform Module allowing us to:

Pull terraform states generated by the nebari CLI when deploying the cluster.
Creating required terraform outputs that are later used by other Terraform modules using Terragrunt dependency mechanism.

This works for us today but has several drawbacks:

If Nebari changes something in its Terraform configuration it can break our Nebari state extraction module. Ex: A Terraform resource gets renamed then the JSON Path in the Terraform State will change and therefore break our module.
Distributing Naas becomes a bit more complicated because our users that want to deploy Naas will have to first deploy Nebari using the CLI, store all outputs safely somewhere, and then configure our nebari state extraction module with the right S3 bucket, S3 prefix etc. Even though this is deterministic, it's adding a step.
Our users only have one way to deploy Naas, by using our Terragrunt infrastructure setup, which is not a bad thing but can be quite complicated for testing purposes or smaller deployments (even though we could figure a simpler Terragrunt deployment method where all needed resources could be embedded in a single repo, not recommended for long term use though).

What we think would be a great way to interact with/use the Nebari CLI.

A big strength of the Nebari CLI, that we think is worth keeping, no matter what is done next, is the fact that it's very easy to deploy a Nebari infrastructure. You just need to have to run npm install nebari and follow the 2-step deployment to, configure your deployment and then actually deploy infrastructure.

This will fit a lot of user needs.

On the other hand, there is a need to help users/companies that wants to deploy Nebari and customize it using IaC to do it.

Regarding the fact that, as of today, Nebari CLI is a wrapper around Terraform, to order and split deployments into multiple stages, we could definitely have a parameter to maybe not actually deploy Nebari infrastructure using the CLI but rather use it to template/render Terragrunt configuration for example (Terragrunt being one of many templating outputs that we could have, Terraform, Pulumi, AWS CDK and so on).

To give an example, let's say I want to deploy Nebari in our Naas Terragrunt infrastructure. Let's assume that we need to deploy in a dev environment, on aws in us-west-1.

I would like to be able to do:

# Create the needed folder structure to match our way to split deployment.
mkdir -p infrastructure-live/dev/aws/us-west-1/nebari

# Go to Nebari folder for dev/aws/us-west-1 deployment.
cd infrastructure-live/dev/aws/us-west-1/nebari

# Install Nebari CLI
conda install nebari -c conda-forge

# Configure my deployment
nebari init --guided-init

# Generate Terragrunt configuration (Usually it would be the time to run "nebari deploy -c nebari-config.yaml")
nebari generate -c nebari-config.yaml --template-to=Terragrunt

Then if I were to execute the tree command in the infrastructure-live directory I would get the following output:

.
└── dev
    └── aws
        └── us-west-1
            └── nebari
                ├── infrastructure
                │   └── terragrunt.hcl
                ├── kubernetes-ingress
                │   └── terragrunt.hcl
                ├── kubernetes-initialize
                │   └── terragrunt.hcl
                ├── kubernetes-keycloak
                │   └── terragrunt.hcl
                ├── kubernetes-keycloak-configuration
                │   └── terragrunt.hcl
                ├── kubernetes-services
                │   └── terragrunt.hcl
                ├── nebari-config.yaml
                └── nebari-tf-extensions
                    └── terragrunt.hcl

12 directories, 8 files

Now it would be up to me to deploy the infrastructure by running for example:

cd dev/aws/us-west-1/nebari
aws-vault exec <myprofile> -- terragrunt run-all apply

So now what happens with pluggy extensions?

This would allow us to build our own nebari-naas pluggy extension and use it like so:

cd dev/aws/us-west-1/nebari
pip install nebari-naas

nebari extensions nebari-naas generate -c nebari-config.yaml --template-to=Terragrunt

This would then have generate a new directory with it's Terragrunt configuration that could have Nebari previous stages as dependencies.

Example:

.
└── dev
    └── aws
        └── us-west-1
            └── nebari
                ├── infrastructure
                │   └── terragrunt.hcl
                ├── kubernetes-ingress
                │   └── terragrunt.hcl
                ├── kubernetes-initialize
                │   └── terragrunt.hcl
                ├── kubernetes-keycloak
                │   └── terragrunt.hcl
                ├── kubernetes-keycloak-configuration
                │   └── terragrunt.hcl
                ├── kubernetes-services
                │   └── terragrunt.hcl
                ├── nebari-config.yaml
                ├── nebari-naas 👈 
                │   └── terragrunt.hcl 👈 
                └── nebari-tf-extensions
                    └── terragrunt.hcl

13 directories, 9 files

Then if we want to have multiple naas extensions as well we could distribute:

pip install nebari-naas
pip install nebari-naas-extension-aaa
pip install nebari-naas-extension-bbb
pip install nebari-naas-extension-ccc

Conclusion

Here I mainly talked about the fact to generate Terragrunt configuration, but I think that this should just be an additional way of being able to deploy Nebari and extensions. Nebari and extensions should also be able to be deployed solely using the Nebari CLI.

I tried to give as much information as possible, but please, if something is not clear enough or you feel that it needs more explanation, tell me and I will try to make it clearer.

I am looking forward to having the possibility to deploy Nebari in a very modular way while still complying with most users/companies needs in terms of deployment strategy. I think that this extension mechanism can be a very strategic move for the adoption of Nebari and the growth of its ecosystem.

nebari-dev / governance

RFD - Extension Mechanism for Nebari #35