interuss / dss

InterUSS Platform's implementation of the ASTM DSS concept for RID and flight coordination.
Apache License 2.0
124 stars 90 forks source link

Simplify onboarding new participants, including for more major cloud providers #874

Open barroco opened 2 years ago

barroco commented 2 years ago

The instructions for bringing up a DSS instance are pretty actionable (complete, clear) currently, but they’re very long and require a fair amount of engineering expertise. We have a tool under development called deployment_manager which should simplify this process substantially, and therefore make deployment of a DSS instance easier

Deployment instructions: https://github.com/interuss/dss/tree/master/build Deployment tool: https://github.com/interuss/dss/tree/master/monitoring/deployment_manager

barroco commented 2 years ago

Following discussions with @BenjaminPelletier @BradNicolle and @marcadamsge, the plan to update the DSS deployment approach to support other cloud providers and keep it manageable for InterUSS over time.

Background

The deployment of the DSS is currently mostly documented on a README. Kubernetes (K8s) deployment instructions only cover GKE. Tanka is used to generate and configure kubernetes resources. In addition, the DSS codebase is being refactored to require only one container instead of two currently. Most of the complexity lies in getting a kubernetes cluster running for cockroach-db ready to be pooled and the pooling steps. We have undertaken the process of extracting self-contained modules to separate repositories. Finally, we are starting the work to support other cloud providers.

Default DSS infrastructure

The DSS is composed of two different services to run (we assume the http-gateway and core-service application as one since refactoring is under way): the DSS API and the cockroach database. In addition, the current configuration proposes the following supporting services:

The requirements for the InterUSS standard deployment of the DSS in terms of infrastructure are:

Objectives and change plan overview

1. Infrastructure as code

Conceptually, the deployment will be broken down in three main categories:

Infrastructure: It is responsible for the cloud resources required to run the DSS services. It includes the kubernetes cluster creation, cluster nodes, load balancer and associated fixed IPs, etc. This stage is cloud provider specific. The objective is to support Amazon Web Services (EKS), Azure (AKS), Google (GKE). To manage multi-cloud resources, we propose to use terraform providers [C.1]. Using terraform providers will offer the following benefits of infrastructure as code: Limit the number of untested command line steps in the READMEs. Allow users to keep track of the infrastructure lifecycle and run simple upgrades. Practice multi-cloud deployment as part of the CI/CD.

Services: The ambition is to be cloud provider agnostic for the services part. It will be responsible for managing Kubernetes resources. We will distinguish core services which are the minimal set of services required by the DSS and supporting services, which may be of interest for users wishing to operate the DSS out of the box.

Currently, services are deployed using tanka. Tanka provides a templating mechanism to k8s manifests. The second main change proposal is to replace Tanka with Helm [C.2]. In addition to the templating feature, it would offer the benefit of packaging and publishing helm charts so more advanced users can reuse it for their own deployments. Helm is especially well suited for gitops deployments. Helm charts are versioned and can be used to automate upgrade lifecycle. Helm can be published to cloud providers container registries. It supports hooks and testing to allow sequences of operations in upgrades and validation steps.

Operations: Diagnostic and utilities operations such as certificates management may be simplified using the deployment manager CLI tool / pod.

To keep the learning curve and maintenance burden low, new users should be able to deploy the DSS with knowledge of terraform only. Advanced users running their own infrastructure should be able to deploy the DSS using the Helm Chart directly.

2. New repository structure

This is the opportunity to reorganize the repository structure incrementally to split build and deployment [C.3]. All assets are currently located in the build folder and expect users to work by default in an ignored folder build/workspace/. A new folder at the root of the repository may be created with the following structure:

/deploy
    infrastructure (terraform)
        aks *
        eks *
        gke *
    common (common modules, if needed)
    services
        tanka 
        helm *
    operations
        README (how to use the deployment manager)
        scripts
            make_certs
            apply_certs
    workspace (environment definitions, custom to each user)
        example
            example.tfvars (User variables for the infrastructure deployment)
            services.yaml (User values for the helm chart)
            main.tf (terraform deployment specification)
            certs/ (Generated certificates - this should move to the secret manager store, see **C.7**)

3. Extract deployment example to a new repository

Terraform modules, helm charts and the deployment manager CLI can be packaged and published. [C.4] Once those components can be installed from a publicly available registry, an example repository could be created supporting users to work outside the main dss repository for their own deployment. [C.5]

4. Use secret manager to store the generated certificates

Currently certificates are generated in the repository in an ignored folder. Storing them in a secret manager. [C.7] The following services are available: Google: https://cloud.google.com/secret-manager AWS: https://aws.amazon.com/fr/secrets-manager/ Azure: https://azure.microsoft.com/en-us/products/key-vault/ The secret manager will be provisioned by the infrastructure stage and filled and updated by the deployment manager. Secrets will be exposed as a K8s resource in the cluster or via CLI for local usage.

5. Automatically test the deployment

Once the infrastructure and the services can be deployed using infrastructure as code, the pooling procedure of a DSS Region deployment with multi-cloud DSS instances can be added to the CI/CD. [C.6] The pooling procedure will be orchestrated by the deployment manager. This will support committers and contributors to gain confidence on contributions / changes to the deployment procedure and unnoticed changes of cloud providers.

Changes summary

Priority 1

C.1: Introduction of terraform to manage the infrastructure stage for each cloud provider. C.3: Reorganize the dss repository to make a separation between build and deployment. C.2: Replacement of tanka with helm charts. C.4: Publish DSS helm chart and terraform modules for each cloud provider to simplify usage outside of the repository. C.5: Example repository using published artifacts.

Priority 2

C.7: Use secret manager to store the certificates.

Priority 3

C.6: Test in the CI/CD the deployment of a DSS Region with multi-cloud DSS instances and test the pooling procedure.