Ardiea commented 9 months ago

Branch https://github.com/mitodl/ol-infrastructure/tree/md/ecs_init

[x] Finalize traefik ingress for the cluster
- [x] Traefik starts on every node and remains running.
- [x] Any loadbalancer address -> traefik.
[x] Finalize a vault-consul-template sidecar for traefik to load *.odl.mit.edu certificate.
- [x] Probably needed to get traefik to start and remain running.
- [x] Loads the certificate+key into a shared volume on each instance of traefik.
- [x] Needs to use AWS IAM permissions that are pinned to the task execution role in vault. See: https://developer.hashicorp.com/vault/tutorials/vault-agent/agent-aws-ecs
[ ] Figure out waypoint or some other platform for managing application deployments into ecs.
- [ ] There are more subtasks here
[ ] Deploy an application into ECS using whatever platform.

Ardiea commented 8 months ago

Gotchas with ECS / Traefik / Vault

Vault comes with an env var VAULT_LOCAL_CONFIG which if you populate it with json, the entrypoint.sh script will dump it into $VAULT_CONFIG/local.json for you. Easy peasy.
Traefik default certificates via the file provider (which is the only way to get them) are tricky. What we do is make fake template in the vault configuration that creates the tls.yaml file defining the dynamic config needed. Render this fake template which contains no actual secrets, at the same time the two real templates are rendered for the key+certificate.
- Ultimately, this doesn't matter too much but it is a good proof-of-concept showing how to get a secret from vault and render it to a file. All TLS termination from the user perspective will happen at the LB.
Traefik ECS provider is neat but by default looks at ALL ECS clusters which is probably not ideal. We let it do that with --providers.ecs.autoDiscoverClusters=True but we don't want it to do any discovery within those clusters besides the one it is running in. Restrict that with --provider.ecs.clusters={cluster_name} where {cluster_name} is the the name of the current cluster.
Traefik ECS provider also expects any / all containers in the cluster to be things that you want to route to, which, again, is not the case so set. --providers.ecs.exposedByDefault=False. Now containers we route to will require special labels/annotations in order for traefik to discover them and setup configurations.
File permissions are weird and still not entirely understood by myself. I was able to make it work by creating a shared volume between vault + traefik and then in the vault container mounting that shared volume at: /vault/file which is different: https://github.com/hashicorp/vault/blob/main/Dockerfile#L139. Have vault render its secrets into this directory. Mount the same shared volume in traefik at /etc/traefik/tls and you can load the certificates and the dynamic config needed.
Docker healthchecks seem to be helpful in ECS. Generally those are done with shell commands inside the executing containers themselves. Vault is easy and we use what Hashicorp suggests vault status. Traefik requires --ping to be specified as a command line argument in BOTH the running container / commands list and in the healthcheck but only AFTER the work healthcheck, so : traefik healthcheck --ping. Put it before and it errors in a most unhelpful way.

root@ip-172-17-1-56:/etc/docker# docker ps
CONTAINER ID   IMAGE                            COMMAND                  CREATED             STATUS                       PORTS     NAMES
fed5626c3518   traefik:v2.10.4                  "/entrypoint.sh --ap…"   About an hour ago   Up About an hour (healthy)             ecs-data-ci-traefik-46-data-ci-traefik-c48599c2e2becdaf6300
4295b1aca17a   hashicorp/vault:latest           "docker-entrypoint.s…"   About an hour ago   Up About an hour (healthy)             ecs-data-ci-traefik-46-traefik-vault-agent-f4e89caaa8fa95e6cf01
27c6ff18d366   amazon/amazon-ecs-agent:latest   "/agent"                 3 weeks ago         Up 3 weeks (healthy)                   ecs-agent

Ardiea commented 8 months ago

Outstanding Issue - Environment Variables

So, there is one outstanding issue at the moment that I'm struggling with the best approach to and that is environment variables. ECS offers two ways to do env vars documented here. There is an extension/exception to that for secrets using SecretsManager but it isn't that interesting because we don't use that.

So, from the two provided methods we have the following.

List keys + values out individulally for each env var inside the task.
- Pro: Pretty straight-forward.
- Con: Locked into static secrets at pulumi-run-time. Lose a lot of flexibility that comes with vault + consul for populating a lot of the more interesting bits of this config.
- Con: Makes the task definition big and ungainly.
Populate a .env file and stuff it some place safe in S3
- Pro: Pretty straight-forward.
- Con: Comes with a bunch of IAM foolishness to keep it secret, keep it safe).
- Con: Locked into static secrets at pulumi-run-time. Lose a lot of flexibility that comes with vault + consul for populating a lot of the more interesting bits of this config.

Notably absent from that list is just a .env file on the local system. Probably because the underlying EC2 instances are supposed to be livestock, not pets. And livestock doesn't have any local files.

So I'm thinking something a little more flexible but probably more janky.

Follow the already defined and explored pattern of vault/consul-template sidecar to render a file which is essentially our existing .env file for docker compose. This file is rendered into a shared volume.
In the actual application containers, we add an entrypoint.sh that opens that file, loops through it and exports every key into the environment and then launches the app.

Doesn't envconsul do this already? Yeah, probably, but it is very particular about the keynames in consul and vault and re-organizing / cleaning those superfund sites up is outside the scope of this exploration.

blarghmatey commented 8 months ago

consul-template itself can also be used for spawning the process after rendering the config. It might make sense to use that as the entrypoint? https://github.com/hashicorp/consul-template/blob/main/docs/modes.md#exec-mode

Ardiea commented 8 months ago

Configuration Challenges

There is nothing analogous to a k8s configMap or a docker config in ECS which is presenting some issues. This SO comment covers basically the only options for getting files into containers with ECS: https://stackoverflow.com/a/71704130

Consider the following volume mount list for the nginx sidecar in OVS:

https://github.com/mitodl/ol-infrastructure/blob/main/src/bilder/images/odl_video_service/files/docker-compose.yaml.tmpl

Some of these files are static and unchanging, others require interpolation from vault, and some are rendered entirely from vault. Each of these situations requires a slightly different approach in order to get the configuration where it needs to be in the container. And nearly all of those approaches is going to be complicated and janky. Ultimately this is going to lead to an increase in complexity and boilerplate which is not what we're looking for at this time.

mitodl / ol-infrastructure

Outstanding ECS Cluster items #1809

Outstanding Issue - Environment Variables

Configuration Challenges