populationgenomics / cpg-infrastructure

This repository is used to manage the infrastructure at the CPG
MIT License
3 stars 1 forks source link

CPG Infrastructure

The CPG manages its cloud infrastructure through Pulumi. Specifically, we have developed an abstraction over the GCP and Azure clouds, that allow us with one driver.py to spin up infrastructure on both GCP and Azure. We manage all infrastructure for all datasets in one stack.

This repository contains all the driving code to build our pulumi stack, but none of the actual configuration. In your own environment, once you have a CPGInfrastructureConfig and each dataset's CPGDatasetConfig, you can instantiate a CPGInfrastructure object and call main within a pulumi context:

# inside a pulumi context
config = CPGInfrastructureConfig.from_dict(...)
datasets = [CPGDatasetConfig.from_dict(...) for d in _datasets]
infra = CPGInfrastructure(config, datasets)
infra.main()

This creates pulumi resources, which pulumi could then create.

Overview

There are 3 levels to our infrastructure which is represented by the 3 driver classes below:

Noting that we often refer to a common dataset, which is where we place most CPG-wide infrastructure - and by default, all datasets have access to resources within this common dataset.

Configuration

The core of configuration is the CPGInfrastructureConfig and CPGDatasetConfig. These validate on construction, so you'll know if you have a valid config before running any more code.

The CPGInfrastructureConfig provides config information about the whole infrastructure. This still contains references to resources that were created manually - and is structured into sections. See the CPGInfrastructureConfig class for more information (it's fairly well documented).

The CPGDatasetConfig contains configuration information for each datasets to deploy. Note you MUST supply a config for the common dataset.

Driver

Driver classes:

For the most part, each tangible resource (bucket, artifact registry, secret) is a cached property so that we can use it multiple times without creating multiple copies AND has the benefit that if the property isn't accessed, the resource isn't created. Another benefit is we don't have to fully initialise a dataset before other drivers can use it.

@cached_property
def main_bucket(self):
    return self.infra.create_bucket(
        'main',
        lifecycle_rules=[self.infra.bucket_rule_undelete()],
        autoclass=self.dataset_config.autoclass,
    )

When the CPGInfrastructure creates each CPGDatasetInfrastructure which creates each CPGDatasetCloudInfrastructure, it passes a reference to itself, so that a dataset could access org-wide resources.

Group memberships

We manage groups and memberships related to datasets in Pulumi. This allows us to have:

There are 4 places where group memberships are stored:

Note, in our implementation, we create placeholder groups through the majority of the code, and at the end, we call CPGInfrastructure.finalize_groups to create outputs to the aforementioned 4 places.

Abstraction

We want to effectively mirror our infrastructure across GCP and Azure, to reduce code duplication we have a cloud abstraction, which provides an interface each cloud implements to achieve a desired functionality.

This abstraction, and our infra model was created with GCP in mind first, then Azure partially implemented later. There may be cloud concepts in the abstraction that don't exist, or aren't reasonable to ask of this interface.

For this reason, and the fact we're still primarily GCP, there are still places in each driver where we only create infrastructure on GCP, or have written cloud-specific implementations.

Plugins

There is sometimes behaviour that we want to make optional, or not declare it in this reposistory. For that use case, we have CpgInfrastructurePlugin, which are exposed through Python entrypoints using the key "cpginfra.plugins".

Currently the BillingAggregator, and MetamistInfra are the two plugins used in our deploy.

Internal datasets

This concept was designed to make it easier to have developers added to internal Hail Batch projects and dataproc logs to facilitate debugging.

To do this:

Infrastructure

Dataset infrastructure

See Reproducible analyses and Storage policies for more information.

Each dataset consists of a number of resources - some resources like buckets, and machine accounts are grouped into the different namespaces:

Members have different roles within a dataset, those roles include:

Setup

In our production environment, we have a:

And some code that takes this format, and transforms this into the required classes. We structure it this way, to allow for easier code-reviews and CODEOWNERS.

CPG setup

You can't deploy (up) the stack locally, as you won't have permissions. But you will be able to preview.

# install pulumi
brew install pulumi

# use our GCS bucket for state
pulumi login gs://cpg-pulumi-state/

# inside the cpg-infrastructure directory
virtualenv cpg-infra
pip install -e .

# our pulumi stack is fairly large, so we'll run in a non-interactive view
PULUMI_EXPERIMENTAL=true PULUMI_SKIP_CHECKPOINTS=true pulumi preview \
  --non-interactive --diff -p 20

Third party setup

Context

Date: August, 2022

The CPG’s current infrastructure has been in place for 2 years. With the addition of Azure as well as GCP, now is a good time to reconsider how we achieve certain infrastructure components for future proofing.

To manage infrastructure across GCP and Azure, as suggested by Greg Smith (Microsoft), we should write an abstraction on top of Pulumi for spinning up infrastructure in GCP and Azure without having to duplicate the “infrastructure design”.

Structure:

To develop, you can run the driver file directly, which given a config TOML, will print infrastructure to the console.

Motiviations

This abstraction is still trying to address a number of difficult problems:

Still to solve problems: