pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
126 stars 54 forks source link

Handling Secrets #79

Open TomAugspurger opened 3 years ago

TomAugspurger commented 3 years ago

Just capturing a thought on how to handle secrets.

There are two main kinds:

  1. Bakery secrets: Things like "credentials to write to the {azure / aws / google} bucket"
  2. Recipe secrets: Things like "The password to access this password-protected HTTP server for this data" (https://github.com/pangeo-forge/pangeo-forge/issues/53)

The bakery secrets should never leave that bakery. In particular, things shouldn't be stored in GitHub and provided to the GitHub action kicking things off. That ensures that the maintainer of an AWS bakery doesn't have access to the Azure bakery's secrets.

Recipe-specific secrets will need to be handled separately. Since a recipe may run on multiple bakeries, these should probably be stored in GitHub and provided to the flow through the GitHub action (possibly set as an environment variable that's passed along to the worker executing the Prefect flow, and then looked up in the recipe). Since they're recipe-specific, they would ideally be stored in the repository's secrets (i.e. not the organization). The staged-recipes stuff slightly complicates this, since we need to ensure that the secret is copied over when the repository is created. I've never used it, but the GitHub API apparently supports working with GitHub secrets: https://docs.github.com/en/rest/reference/actions#secrets

rabernat commented 3 years ago

Another option could be to store the secrets directly in the recipe repo but encrypted with a pangeo-forge public key, with the private key stored as an org-level secret. This still requires the recipe maintainer to trust the org with their secrets, but it might make things easier to manage.

scottyhq commented 3 years ago

Sharing some notes on a recent foray into managing both types of secrets in github. 1) cloud-account keys as github secrets and 2) configuration secrets in encrypted text files with mozilla SOPS

I'm sure there are a million different ways to configure this stuff, but i've been impressed with using mozilla SOPS to keep encrypted data on github. what's nice about it is that you can still see the config, rather than just a binary file: https://github.com/uwhackweek/jupyterhub-deploy/blob/main/hub/secrets.yaml

We recently setup a template repository that we want to use for quickly setting up jupyterhub (but this could be other things running on k8s). Step 1 is setting up infrastructure on the cloud, and for that we use terraform to: 1. create a machine user on the cloud account with CLI access keys 2. the machine user only has permissions to assume a role. 3. that role is limited to certain actions. so worst case scenario if the access keys leak, it is easy to revoke (and rotate them) without much disruption. Any GitHub Actions CI operations run with that machine user assuming the role: https://github.com/uwhackweek/jupyterhub-deploy/blob/f6c2fc17cf16aeccf9e959bdfe3eedf8ed363129/.github/workflows/Terraform.yml#L27-L34

Part2) SOPS needs a key to encrypt things, so we use AWS KMS for that, and then our machine user role has permissions to access the key from any machine to decrypt config files on github. this is especially simple to implement for helm values.yaml files with the helm-secrets plugin, but should work for any JSON or yaml config: https://github.com/uwhackweek/jupyterhub-deploy/blob/f6c2fc17cf16aeccf9e959bdfe3eedf8ed363129/.github/workflows/Helm.yml#L42-L65

pangeo-forge has added complications of multiple clouds and permissions boundaries, but hopefully these notes and examples might be helpful, give ideas for alternative solutions.... or maybe someone will recognize gaping problems in what we're doing and let us know because we're by no means security experts ;)

cisaacstern commented 2 years ago

Since @yuvipanda has been thinking about this, here are some of the current staged-recipes PRs blocked for lack of a recipe level credentials feature: https://github.com/pangeo-forge/staged-recipes/labels/blocked%3Acredentials

yuvipanda commented 2 years ago

Thanks for the pointers, @cisaacstern and @sharkinsspatial!

Thinking about EarthData specifically, here are a few questions I have:

  1. What is the stated purpose of these credentials? I had assumed they are primarily for accounting and rate limiting, so one user doesn't eat up all capacity. Does that seem right?
  2. What happens when (not if) these credentials leak? This isn't necessarily because of a failure on pangeo-forge part, but just a fact of life now (https://haveibeenpwned.com/). How would this be replaced / regenerated?
  3. What are the consequences of basically in many cases sharing a password to a service where you had agreed to specific agreements with an arbitrary number of people?

IMO, these make me feel that the thing we are authenticating is the bakery, and not the user contributing the recipe. This also helps with what I think of as the core technical reason (rate limiting), as it identifies which entity is making the request.

So my suggestion is:

  1. We develop methods for recipes to declare what kind of credentials they need
  2. We provide ways for bakeries to declare what kind of credentials they have provisioned
  3. we mix / match this, so appropriate recipes get sent to appropriate places.

If there's some more specialized access needed for some recipes, they would need to work with the appropriate bakery that has them. This would allow a systemized way of making sure we have access to rotate credentials, deal with password leaks, and have an actual contact point who can help debug issues if needed. We already do this for a lot of other credentials - dataflow accounts, OSN keys, etc. If our OSN keys get compromised, somehow there is a process for getting that resolved. IMO Earthdata (and similar) should be handled similarly.

What do folks think of this? I can prototype how this might look in meta.yaml maybe.

cisaacstern commented 2 years ago

We develop methods for recipes to declare what kind of credentials they need We provide ways for bakeries to declare what kind of credentials they have provisioned we mix / match this, so appropriate recipes get sent to appropriate places.

Agree 100%

yuvipanda commented 2 years ago

Not at all sure how exactly this would look, but here's some possible ways how this would look like for recipe authors: https://github.com/yuvipanda/staged-recipes/pull/1

cisaacstern commented 2 years ago

That looks great. As I just noted to you offline, this makes me think we really need a JSON Schema for meta.yaml. (Or some method for tracking what fields are valid there.) AFAICT, we don't have an open issue for meta.yaml schema but I might be forgetting something.

yuvipanda commented 2 years ago

@cisaacstern yes 100% on need for JSON Schema - I think JSON Schema is the right way to go :)