mozilla-services / Dockerflow

Cloud Services Dockerflow specification
Apache License 2.0
195 stars 28 forks source link

Where should the container build and test happen? #8

Closed mostlygeek closed 8 years ago

mostlygeek commented 8 years ago

The current recommendation is to build and test the container in Circle CI. The motivation for this is:

However, some questions were brought up around:

@oremj @ckolos @jvehent please add anything I've missed

mostlygeek commented 8 years ago

ability to re-build a container without dev support

When using circle-ci, we can trigger a rebuild with a single button click. I recently did this for tokenserver. The base container (python:2.7.11) was rebuilt, likely due to security patches. By clicking rebuild it built an updated container and replaced the tagged version on dockerhub.

Since we keep a mirror of docker images we would still need to trigger a new stack deploy. The story around this isn't good for containers or rpms at this point.

jvehent commented 8 years ago

My main concern is vulnerability management: upgrading all of our stacks is currently costly and painful, but also needs to be done every quarter on average (when a major vuln comes out). I'd like to make sure we don't make it more complex than it currently is, and even make it easier if possible.

The workflow, as I understand it so far, is developers build containers in a CI they control, upload that to some dockerhub, and we mirror that hub for stage and production deploy. If a container needs to be updated, we ask the dev team to trigger a new build of the container we can then mirror. My concerns with this flow are:

We can prevent both issues by continuously rebuilding and redeploying containers, like we do with AMIs. Every month, all application containers should be rebuilt on the latest base and redeployed. This guarantees containers are always up-to-date on security patches, and can always be rebuilt. On active projects, this process should not be triggered, because regular build & deploys happen as part of the ongoing features release. But on stale projects, no longer driven by feature changes, I think ops should control regular builds & deploys.

What I propose is:

  1. developers must provide a working container (let's call that Cdev). Cdev is stored in some developer controlled dockerhub.
  2. Cdev is deployed in the dev environment. ops are not involved in that process.
  3. ops use the same scripts used to build Cdev in the ops CI environment (jenkins) to build a container called Cops. ops only build containers, and don't run complex integration tests (maybe unit tests). ops assume that the content of the master branch of a project has passed sufficient testing to be built. Cops is stored in cloud-ops' private dockerhub.
  4. Cops is deployed in the stage and prod environments.
  5. If Cops is ever older than 30 days, an automatic rebuild is triggered in ops CI, stage is redeployed and QA, then prod is redeployed. This process triggers even if no feature change was pushed to the upstream project.
oremj commented 8 years ago

developers must provide a working container (let's call that Cdev). Cdev is stored in some developer controlled dockerhub. Cdev is deployed in the dev environment. ops are not involved in that process.

We are involved in deploying the rest of the "dev" pipeline, so I'd argue, if we are going with your proposal, we should build the "dev" containers as well. "dev" in my case means, the environment that is up to date with the primary branch, usually origin/master.

jvehent commented 8 years ago

That makes sense. The point of asking devs to build Cdev is to require a repeatable container build process as part of the application. We don't need to use Cdev, we just need to make sure it exists, as a proof that we can build containers ourselves.

mostlygeek commented 8 years ago

We can prevent both issues by continuously rebuilding and redeploying containers, like we do with AMIs.

I don't think that's right because we don't have a 'no access' problem and it doesn't solve the 'not maintained software' issue.

The ability to build/push updated containers is not a problem with circleci. Their permission model is if you can push to a github repo you can manage a circleci project. Yesterday I triggered a rebuild on tokenserver because its base container changed. No developer was required. Just pushed a button and waited.

Also, yesterday I security patched browserid-verifier's container. Rebuilding the same container, RPM or AMI would not have fixed anything. A code changed was required. Once the PR was merged I just waited until CircleCI was done.

Fundamentally fixing vulnerabilities begins with fixing code. Fixing services is deploying fixed code. Deployment is the last step. Dockerflow is meant to make a part of that last step easier. Everything before that is out of Dockerflow's scope.

I do think it is important to put together a requirements list for evaluating appropriate CI solutions. The top of the list should be:

jvehent commented 8 years ago

We currently don't automatically redeploy services every 30 days, but we should. Doesn't the dependency to circleci augment the complexity of redeploying, as opposed to having everything in a single jenkins build/deploy/promote pipeline?

jvehent commented 8 years ago

To answer my own question: no, circleci doesn't augment deployment complexity, if we figure out a way to trigger a circleci job from jenkins through its API. That also means devs should only use circleci. I don't know if that was a requirement previously...

oremj commented 8 years ago

Using CircleCI is okay, if it reduces complexity. Initially I heard, "devs give us container and we deploy". If the flow goes through CircleCI, that means we can trigger rebuilds there and the build instructions will clearly be documented in the circle config.

Concerns:

mostlygeek commented 8 years ago

That also means devs should only use circleci. I don't know if that was a requirement previously...

That's not a requirement, it's not even a recommendation. I was hoping the CI wouldn't matter but as we increase tooling and define CI requirements standardizing on Circle does make sense. I'm gathering feedback on this currently. Will see how this turns out.

To address some of @oremj's concerns:

we wouldn't be able to deploy if Circle were down

Deployment isn't be affected by circle being down but build, test, push to DH would be. We would likely weigh the risk of waiting vs doing the build chain manually.

we have no control over build performance or queuing

It is $50/mo for each additional container. Each container gives us an extra parallel build. Open source projects get 4 for free. We don't have any control over how fast a container is though.

mostlygeek commented 8 years ago

To address @jvehent's question about augmenting complexity:

We have dozens of projects and they all have different requirements. A pipeline that they all fit through will be complex. I think the best way to keep complexity manageable is to use loosely coupled components.

Dockerflow is a loosely coupled component. Its inputs are the source and a CI control script (circle.yml). Its output is a container on Dockerhub.

Deciding and triggering a rebuild+test of a container should be a feature and the responsibility of something else. That something else shouldn't even be the pipeline tool.

jvehent commented 8 years ago

I think there is a transfer of responsibility happening in the Dockerflow: devs are responsible for building, not only web applications, but system containers that we run in production. The transfer implies they are responsible for keeping those containers up-to-date and their build pipeline operational, both of which may be new to them.

How do we make sure devs are ready to take this on? The loosely coupled approach is good for flexibility, but maybe a list of minimal requirements, aka "you must be this tall to ride", is needed here.

mostlygeek commented 8 years ago

So been thinking / experimenting a bit. To close this issue I will write a requirements document for automated build+test+push CI.

For the Balrog project their Dockerflow implementation uses TaskCluster since that is where all builds for Release Engineering happens. However they've run into a snag but they're happy to fix it themselves.

So it seems that at start for the requirements list for an automated build+test+push would be: