Implement production canary releases

nkinkade commented 5 years ago

We currently only have sandbox -> staging -> production. staging, for now, only get 0.001 of production traffic, which isn't really enough to compare a staging release with an existing production node to be sure that a deployment is working as expected. We need to have a true production canary release cycle where changes made to staging are deployed to some subset of production machines for a period.

nkinkade commented 5 years ago

It turns out that implementing this by using versioned DaemonSets is going to require a good deal more thought. Implementing versioned DaemonSets is pretty easy, as PR #276 demonstrates, but that PR does not take into account deployment details. Specifically these questions need to be answered, and perhaps even others:

How do we properly version any experiment DaemonSet that we wanted versioned? Relying on a Github repository tag to version, for example, an NDT DaemonSet does not take into account that the repository may change and be tagged for all sorts of code changes that have nothing to do with updating the NDT DaemonSet. Yet, the NDT DaemonSet would get a new version anyway, and would require a redeployment.
How do controlled deployments happen when deploying becomes a matter of tagging nodes, not relying on the built-in k8s RollingUpdate mechanisms. It would likely either require some hackish script that an operator would run, or perhaps some service we deploy to the cluster that would manage labeling nodes to do a controlled and incremental rollout. The former is not what we want, and the latter is yet to be written.

nkinkade commented 5 years ago

The thing to do in the short term may be to implement something along the lines of our original idea of just having two DaemonSets for NDT, a production one and a canary one. But even this needs more consideration to avoid some of the same pitfalls as above.

nkinkade commented 5 years ago

For now, it has been decided to just implement this by directing more production traffic to staging via manually tweaking the ReverseProxyProbability in mlab-ns after a merge to master. Once we are confident that the release is fine, we will reset ReverseProxyProbability and then release to production.

m-lab / k8s-support

Implement production canary releases #268