Recreate-all-VMs during apply-changes

aegershman commented 5 years ago

In order to repave all the VMs associated to a deployment (tile) without having to configure the bosh director separately, om should allow a recreate param to be passed on apply-changes.

background

We want to repave the VMs in our production environment every 7 days, regardless of if there is a new stemcell release or not.

However, you currently have to toggle the "recreate all VMs" flag within the bosh director tile in order to turn this feature on. And once the next apply-changes has been set, the "recreate all VMs" toggle gets turned off and is set as a "pending change" for the next apply.

The ability to recreate VMs would be better on a deployment-by-deployment (tile-by-tile) basis.

"shouldn't this be configured in the BOSH director tile before applying those changes?"

It's true that there is a "recreate all VMs" toggle in the director tile, but you shouldn't have to configure the director as part the pipeline before every apply-changes for each product.

There's an analogy to the bosh-cli, where when deploying a single deployment, you can pass --recreate to tell the director to recreate all the VMs within that deployment.

How could this be implemented using the current opsmgr API?

WIthout any changes to the upstream opsmgr API, this would probably require om under-the-covers do a configure-product on the bosh director tile to set the recreate all VMs flag? And we'd still be left in a situation where after the deployment has finished, there would be pending changes to have the bosh director unset "recreate all VMs" since that happens automatically. Not sure if this would be problematic or not. I think it would require an opsmgr API change, but in general, the decision to recreate-all-VMs should likely be configurable on a deployment-by-deployment basis.

Thoughts?

kcboyle commented 5 years ago

This seems interesting! (and something we could maybe bring up with the opsman team).

Could I rephrase and say that you would like a recreate-all-vms to be configurable like opsmanager allows you to do selective deploys?

Digging in a little deeper, what is the use case for recreating the vms every week? What problem are we solving with the recreate a single time (in between cves)?

aegershman commented 5 years ago

I think that's a fair rephrasing. Thoughts @xyloman ?

The use case for recreating VMs every week is twofold:

VMs should be rotated to maintain an expected cadence and comfort. When we do take new stemcells/products which require VM rotation, it's not anything "special" or incites fear. VMs coming up and down should be the expectation. We tell platform consumers that their apps must be cloud-native and resilient to VM rotation, and I want this to be taken seriously. But more importantly--
Recreating VMs is an important security practice. I'm going to defer to this article by Justin Smith called The Three Rs of Enterprise Security: Rotate, Repave, and Repair for more information on the security rationale.

xyloman commented 5 years ago

@aegershman this is exactly my thought.

@kcboyle it could be a separate task however the more I think about it the more I would like to have it in something like the env.yml. if we could put the flag there it:

keeps the pipeline stateless.
Avoids the need for a new om command.
avoids the need for a new task which complicates forking logic which pipelines are not good
allows setting an environment wide posture on if the VMs will be recreated.

kcboyle commented 5 years ago

@xyloman @aegershman is there a way, right now, to do this through only the ops manager API? If so, om could certainly support doing so.

my thoughts as to how this would work currently is that you would have to (if you passed the flag) 1) take the current ops manager config that is deployed 2) automatedly enable the recreate option 3) deploy director-only changes 4) run a normal apply changes to recreate the vms you specified 5) automatedly disable the recreate option 6) deploy director-only changes

IMO, this seems like a lot of responsibilities for a single flag to handle (though I do agree with the reasoning you provided @aegershman ! creating confidence in upgrades is the way to make the world a better place!)

Am i correct in assuming that this is the required workflow, or is there a better way to do so?

aegershman commented 5 years ago

Interestingly the recreate option, when configured, only sticks around for one "apply". If you tick the "recreate" box, then apply changes, right afterwards it will automatically flip back to disabling recreate. So I think you could take out steps 3 and 5 since they happen automatically. Whether they should happen automatically is a discussion for the opsmgr team to figure out, I suppose 😄

It's true, it does seem like extra responsibility for a single flag to handle. This seems like something where the recreate flag should be able to be applied on a product-by-product basis, and not just as a single param in the BOSH director tile.

As such this is probably an API parameter the opsmgr team should make available, and then om would simply make recreate: true a param that's passed? Implementing multi-step business logic to do this doesn't feel like the sole responsibility of om. Just thinking out loud.

But yes in general this would be the workflow. Thanks a ton for your time, by the way.

kcboyle commented 5 years ago

Of course! It seems like maybe this feature isn't so great as a singular om flag or command, but maybe could be good as a runnable script (or as a concourse task in Platform Automation). We can see if we can have a chat with the Ops Manager folks to see how feasible it would be to get a "recreate" box or option for apply-changes.

kcboyle commented 5 years ago

Following up with you @aegershman! We collaborated with a variety of folks and found a solution that might be able to work for this use case. A way to accomplish this with a minimal number of calls could be to do a single property PUT to the /api/v0/staged/director/properties endpoint with the bosh_recreate_on_next_deploy key under director_configuration to ensure VMs are recreated without modifying any of your other director config. After completing the PUT, the apply-changes could run as normal.

I was approaching this problem without considering the configure endpoint was a PATCH. Therefore, with an --recreate-all-vms enabled, apply-changes could run a simple single-property update to the director and apply-changes as normal.

I'd feel comfortable accepting a PR that accomplished this minimal goal, if anyone in the community is open to submitting a PR.

xyloman commented 5 years ago

After the apply changes run would apply changes have to be run again to clear any pending changes to the director?

kcboyle commented 5 years ago

Nope @xyloman. I double checked with the Ops Manager team, and @aegershman was correct when he mentioned this above. The checkbox, once checked, only runs once for the next apply-changes, and then is removed.

xyloman commented 5 years ago

Correct, however I have noticed this puts the director tile into a state with pending changes which has to be applied to clear the state.

aegershman commented 5 years ago

Right, as @xyloman said, it automatically "removes" the recreate-all-VMs config flag on the director after it's run once, so you don't need to make an API call to disable it. But before the removal of the recreate-all-vms is officially "set", you have to run apply-changes on the director again, because it's considered a "pending change". So you have to run apply-changes twice.

kcboyle commented 5 years ago

Hmmmm. That's tricky @aegershman @xyloman . Let me follow up again and see what's up.

kcboyle commented 5 years ago

@xyloman @aegershman , Can confirm. Perhaps this is O.K. behavior. The Ops Manager director is completely independent of the rest of the foundation, and can have apply-changes --skip-deploy-products run to update only the settings for the director itself without affecting any deployments (thus, no observable downtime). It leaves the apply-changes command the potential to leave a "dirty" deployment, but this is easily mitigated by the other command.

From your perspective, if the workflow for this feature was as described above (setting the director properties with the single recreate property before apply-changes) and this was then paired with apply-changes --skip-deploy-products, would that be an acceptable (if not optimal) workflow?

The ops manager team hopes to add this to apply-changes someday, but not anytime in the near future

aegershman commented 5 years ago

I could see this being a configuration flag part of the om apply-changes; om apply-changes --products=pas <other-params> --recreate && toggleable as an env var (BOSH_RECREATE?)

I could see this being done as part of one om command; e.g., since this requires two apply calls to opsmgr (one to apply-changes, then another to do apply-changes --skip-deploy-products for the director), it would be handled by om in a single command like om apply-changes --products=pas --recreate (or something to that effect) && it would only return as success/failure after those two calls succeeded.

Thoughts? /cc @xyloman @dashaun

kcboyle commented 5 years ago

@aegershman how would we feel about doing what you suggested above, but rather than keeping it as that single flag with 3 subcommands, having a new command added to om called recreate-all-vms or something similar. Then we're not overloading apply-changes, and we can be more explicit about what the command is actually doing.

We'd want to keep the --products flag to keep parity. It just feels like having one flag handle all of this additional functionality is too many responsibilities, and could be hard to maintain. (apply-changes does a lot already)

xyloman commented 5 years ago

I think a new command makes a lot more sense. +1 to keeping the products flag and anything else that would be associated to select product deployments.

aegershman commented 5 years ago

That's fine, the reason I was thinking it be a part of the apply command was so toggling the flag could be done as an env variable and reduce branching logic within a Concourse task. There's some situations where you might want to run apply-changes without recreate (e.g. updating a config property of the foundation which wouldn't normally cause a rotation), and some where you would want to. I think I'm misunderstanding our own ideal/intended workflow though; ignore me. 👍

EDIT: actually let me think about this more, unignore me

kcboyle commented 5 years ago

I could imagine this working well in a concourse setting if you had a job with the single recreate-vms task, that ran on a weekly trigger. this could offer more visibility into what the pipeline is doing, though it would add more yaml.

Going to re-add the PR welcome tag, assuming the work is done for a separate command, maintaining functionality of the select product deployments that apply-changes requires/supports.

shanman190 commented 5 years ago

@kcboyle another use case that I have is around recreating all VMs after things such as CA certificate rotations. There are a couple of apply changes with recreates necessary in that process to rotate and then cleanup followed up with the apply changes to clear the pending director changes.

shanman190 commented 5 years ago

So a question comes to mind, if the om command were recreate-all-vms would it also recreate service instance VMs by enabling the recreate-all-service-instances errand for any deployed products?

jtarchie commented 5 years ago

@shanman190: If we were to approach errands, too. We'd need to know the errand name for each service. Are those named consistently, if not, how would we identify them?

shanman190 commented 5 years ago

@jtarchie, yeah, I don't know if they are named the same or not. My guess is that since each bosh release can have it's own errands with their own names and there doesn't seem to be an opinionated way of naming the errands, it probably isn't likely. With that in mind, it's probably just going to be easier to keep a similar API surface area by allowing --config to be passed in for the errands configuration just like it was done for the apply-changes task.

anEXPer commented 4 years ago

We ended up doing this as a --recreate-vms flag on apply-changes because it was way easier, and because it's what we'd like to nudge the API towards on the Ops Manager side (being an option on the apply changes API rather than a self-resetting bit in the Director config). It's on master now. We discovered this flag doesn't actually work for the Director VM, FYI, but Ops Manager says they're going to fix that in an upcoming version, backported to 2.5 plus.

aegershman commented 4 years ago

apologies, was looking over the github commits for it. just checking my understanding:

If you pass --recreate-vms, is it going to try to recreate the director as part of applying changes for other deployments? That is to say, if someone does om apply-changes --product-name pas --recreate-vms, then I'm guessing it recreates the VMs associated to pas and not try to recreate the director. If the user wanted to recreate the director I'd assume it'd be a specific command of om apply-changes --skip-deploy-products --recreate-vms. Is that the gist?

thanks again for all you do 👌

xyloman commented 4 years ago

+1 to a away to selectively recreate the director vm but not have the cost of recreate the director VM if selectively recreating a given product.

anEXPer commented 4 years ago

Aaron, unfortunately, once Ops Manager fixes their bug, the director will not be possible to exclude from recreation. Until they fix their bug, it will always be excluded from recreation.

@dsboulder would be the person to see about the things y'all are wanting.

dsboulder commented 4 years ago

@xyloman If you're really just trying to recreate a deployment (and not trying to get out of a deploy failure), how about using bosh recreate on the deployment directly? OpsManager has never tried to implement all the operational CLI commands that BOSH has, such as bosh start and bosh stop, and we expect folks might want to use these more advanced things from the CLI directly.

aegershman commented 4 years ago

@dsboulder we experimented doing that a few months ago but felt we were subverting opsmgr's control as a source of truth during applies/changes.

I don't think the expectation is for opsmgr to implement all operational CLI commands like start/stop/restart, let alone everything else; though at the end of the day opsmgr is delegating to bosh-cli calls to perform create-env, upload, deploy, etc., so if bosh follows the convention of bosh -d mydeployment deploy mymanifest.yml <...> --recreate, why not follow the same pattern within opsmgr rather than treating --recreate as a global flag applied to the director (rather than on the deployment and as something which also performs the commands of create-env?

xyloman commented 4 years ago

@aegershman very well stated I agree with this sentiment as well. There also runs the risk of a deployment getting locked and the ops manager not reflecting that state. Then if someone attempted to apply changes to that product it would fail because someone subverted ops manager state mechanics. Personally we have discussed this numerous times and we would actually have a serial group in our concourse pipeline which would ensure that the pipelines would not have that issue. But in other environments you might not be able to depend on that.

pivotal-cf / om