Closed aegershman closed 4 years ago
This seems interesting! (and something we could maybe bring up with the opsman team).
Could I rephrase and say that you would like a recreate-all-vms
to be configurable like opsmanager allows you to do selective deploys?
Digging in a little deeper, what is the use case for recreating the vms every week? What problem are we solving with the recreate a single time (in between cves)?
I think that's a fair rephrasing. Thoughts @xyloman ?
The use case for recreating VMs every week is twofold:
VMs should be rotated to maintain an expected cadence and comfort. When we do take new stemcells/products which require VM rotation, it's not anything "special" or incites fear. VMs coming up and down should be the expectation. We tell platform consumers that their apps must be cloud-native and resilient to VM rotation, and I want this to be taken seriously. But more importantly--
Recreating VMs is an important security practice. I'm going to defer to this article by Justin Smith called The Three Rs of Enterprise Security: Rotate, Repave, and Repair for more information on the security rationale.
@aegershman this is exactly my thought.
@kcboyle it could be a separate task however the more I think about it the more I would like to have it in something like the env.yml. if we could put the flag there it:
@xyloman @aegershman is there a way, right now, to do this through only the ops manager API? If so, om
could certainly support doing so.
my thoughts as to how this would work currently is that you would have to (if you passed the flag)
1) take the current ops manager config that is deployed
2) automatedly enable the recreate
option
3) deploy director-only changes
4) run a normal apply changes to recreate the vms you specified
5) automatedly disable the recreate
option
6) deploy director-only changes
IMO, this seems like a lot of responsibilities for a single flag to handle (though I do agree with the reasoning you provided @aegershman ! creating confidence in upgrades is the way to make the world a better place!)
Am i correct in assuming that this is the required workflow, or is there a better way to do so?
Interestingly the recreate
option, when configured, only sticks around for one "apply". If you tick the "recreate" box, then apply changes, right afterwards it will automatically flip back to disabling recreate
. So I think you could take out steps 3
and 5
since they happen automatically. Whether they should happen automatically is a discussion for the opsmgr team to figure out, I suppose π
It's true, it does seem like extra responsibility for a single flag to handle. This seems like something where the recreate
flag should be able to be applied on a product-by-product basis, and not just as a single param in the BOSH director tile.
As such this is probably an API parameter the opsmgr team should make available, and then om
would simply make recreate: true
a param that's passed? Implementing multi-step business logic to do this doesn't feel like the sole responsibility of om
. Just thinking out loud.
But yes in general this would be the workflow. Thanks a ton for your time, by the way.
Of course! It seems like maybe this feature isn't so great as a singular om
flag or command, but maybe could be good as a runnable script (or as a concourse task in Platform Automation). We can see if we can have a chat with the Ops Manager folks to see how feasible it would be to get a "recreate" box or option for apply-changes.
Following up with you @aegershman! We collaborated with a variety of folks and found a solution that might be able to work for this use case. A way to accomplish this with a minimal number of calls could be to do a single property PUT
to the /api/v0/staged/director/properties
endpoint with the bosh_recreate_on_next_deploy
key under director_configuration
to ensure VMs are recreated without modifying any of your other director config. After completing the PUT
, the apply-changes
could run as normal.
I was approaching this problem without considering the configure
endpoint was a PATCH
. Therefore, with an --recreate-all-vms
enabled, apply-changes
could run a simple single-property update to the director and apply-changes
as normal.
I'd feel comfortable accepting a PR that accomplished this minimal goal, if anyone in the community is open to submitting a PR.
After the apply changes run would apply changes have to be run again to clear any pending changes to the director?
Nope @xyloman. I double checked with the Ops Manager team, and @aegershman was correct when he mentioned this above. The checkbox, once checked, only runs once for the next apply-changes
, and then is removed.
Correct, however I have noticed this puts the director tile into a state with pending changes which has to be applied to clear the state.
Right, as @xyloman said, it automatically "removes" the recreate-all-VMs
config flag on the director after it's run once, so you don't need to make an API call to disable it. But before the removal of the recreate-all-vms
is officially "set", you have to run apply-changes
on the director again, because it's considered a "pending change". So you have to run apply-changes
twice.
Hmmmm. That's tricky @aegershman @xyloman . Let me follow up again and see what's up.
@xyloman @aegershman , Can confirm. Perhaps this is O.K. behavior. The Ops Manager director is completely independent of the rest of the foundation, and can have apply-changes --skip-deploy-products
run to update only the settings for the director itself without affecting any deployments (thus, no observable downtime). It leaves the apply-changes
command the potential to leave a "dirty" deployment, but this is easily mitigated by the other command.
From your perspective, if the workflow for this feature was as described above (setting the director properties with the single recreate property before apply-changes
) and this was then paired with apply-changes --skip-deploy-products
, would that be an acceptable (if not optimal) workflow?
The ops manager team hopes to add this to apply-changes someday, but not anytime in the near future
I could see this being a configuration flag part of the om apply-changes
; om apply-changes --products=pas <other-params> --recreate
&& toggleable as an env var (BOSH_RECREATE
?)
I could see this being done as part of one om
command; e.g., since this requires two apply
calls to opsmgr (one to apply-changes
, then another to do apply-changes --skip-deploy-products
for the director), it would be handled by om
in a single command like om apply-changes --products=pas --recreate
(or something to that effect) && it would only return as success/failure after those two calls succeeded.
Thoughts? /cc @xyloman @dashaun
@aegershman how would we feel about doing what you suggested above, but rather than keeping it as that single flag with 3 subcommands, having a new command added to om
called recreate-all-vms
or something similar. Then we're not overloading apply-changes, and we can be more explicit about what the command is actually doing.
We'd want to keep the --products
flag to keep parity. It just feels like having one flag handle all of this additional functionality is too many responsibilities, and could be hard to maintain. (apply-changes
does a lot already)
I think a new command makes a lot more sense. +1 to keeping the products flag and anything else that would be associated to select product deployments.
That's fine, the reason I was thinking it be a part of the apply
command was so toggling the flag could be done as an env
variable and reduce branching logic within a Concourse task. There's some situations where you might want to run apply-changes
without recreate (e.g. updating a config property of the foundation which wouldn't normally cause a rotation), and some where you would want to. I think I'm misunderstanding our own ideal/intended workflow though; ignore me. π
EDIT: actually let me think about this more, unignore me
I could imagine this working well in a concourse setting if you had a job with the single recreate-vms task, that ran on a weekly trigger. this could offer more visibility into what the pipeline is doing, though it would add more yaml.
Going to re-add the PR welcome tag, assuming the work is done for a separate command, maintaining functionality of the select product deployments that apply-changes
requires/supports.
@kcboyle another use case that I have is around recreating all VMs after things such as CA certificate rotations. There are a couple of apply changes with recreates necessary in that process to rotate and then cleanup followed up with the apply changes to clear the pending director changes.
So a question comes to mind, if the om command were recreate-all-vms
would it also recreate service instance VMs by enabling the recreate-all-service-instances
errand for any deployed products?
@shanman190: If we were to approach errands, too. We'd need to know the errand name for each service. Are those named consistently, if not, how would we identify them?
@jtarchie, yeah, I don't know if they are named the same or not. My guess is that since each bosh release can have it's own errands with their own names and there doesn't seem to be an opinionated way of naming the errands, it probably isn't likely. With that in mind, it's probably just going to be easier to keep a similar API surface area by allowing --config
to be passed in for the errands configuration just like it was done for the apply-changes
task.
We ended up doing this as a --recreate-vms
flag on apply-changes
because it was way easier, and because it's what we'd like to nudge the API towards on the Ops Manager side (being an option on the apply changes API rather than a self-resetting bit in the Director config). It's on master
now. We discovered this flag doesn't actually work for the Director VM, FYI, but Ops Manager says they're going to fix that in an upcoming version, backported to 2.5 plus.
apologies, was looking over the github commits for it. just checking my understanding:
If you pass --recreate-vms
, is it going to try to recreate the director as part of applying changes for other deployments? That is to say, if someone does om apply-changes --product-name pas --recreate-vms
, then I'm guessing it recreates the VMs associated to pas
and not try to recreate the director. If the user wanted to recreate the director I'd assume it'd be a specific command of om apply-changes --skip-deploy-products --recreate-vms
. Is that the gist?
thanks again for all you do π
+1 to a away to selectively recreate the director vm but not have the cost of recreate the director VM if selectively recreating a given product.
Aaron, unfortunately, once Ops Manager fixes their bug, the director will not be possible to exclude from recreation. Until they fix their bug, it will always be excluded from recreation.
@dsboulder would be the person to see about the things y'all are wanting.
@xyloman If you're really just trying to recreate a deployment (and not trying to get out of a deploy failure), how about using bosh recreate
on the deployment directly? OpsManager has never tried to implement all the operational CLI commands that BOSH has, such as bosh start
and bosh stop
, and we expect folks might want to use these more advanced things from the CLI directly.
@dsboulder we experimented doing that a few months ago but felt we were subverting opsmgr's control as a source of truth during applies/changes.
I don't think the expectation is for opsmgr to implement all operational CLI commands like start/stop/restart, let alone everything else; though at the end of the day opsmgr is delegating to bosh-cli calls to perform create-env
, upload
, deploy
, etc., so if bosh
follows the convention of bosh -d mydeployment deploy mymanifest.yml <...> --recreate
, why not follow the same pattern within opsmgr rather than treating --recreate
as a global flag applied to the director (rather than on the deployment
and as something which also performs the commands of create-env
?
@aegershman very well stated I agree with this sentiment as well. There also runs the risk of a deployment getting locked and the ops manager not reflecting that state. Then if someone attempted to apply changes to that product it would fail because someone subverted ops manager state mechanics. Personally we have discussed this numerous times and we would actually have a serial group in our concourse pipeline which would ensure that the pipelines would not have that issue. But in other environments you might not be able to depend on that.
In order to repave all the VMs associated to a deployment (tile) without having to configure the bosh director separately,
om
should allow arecreate
param to be passed onapply-changes
.background
We want to repave the VMs in our production environment every 7 days, regardless of if there is a new stemcell release or not.
However, you currently have to toggle the "recreate all VMs" flag within the bosh director tile in order to turn this feature on. And once the next apply-changes has been set, the "recreate all VMs" toggle gets turned off and is set as a "pending change" for the next apply.
The ability to recreate VMs would be better on a deployment-by-deployment (tile-by-tile) basis.
"shouldn't this be configured in the BOSH director tile before applying those changes?"
It's true that there is a "recreate all VMs" toggle in the director tile, but you shouldn't have to configure the director as part the pipeline before every apply-changes for each product.
There's an analogy to the bosh-cli, where when deploying a single deployment, you can pass
--recreate
to tell the director to recreate all the VMs within that deployment.How could this be implemented using the current opsmgr API?
WIthout any changes to the upstream opsmgr API, this would probably require
om
under-the-covers do aconfigure-product
on the bosh director tile to set therecreate all VMs
flag? And we'd still be left in a situation where after the deployment has finished, there would be pending changes to have the bosh director unset "recreate all VMs" since that happens automatically. Not sure if this would be problematic or not. I think it would require an opsmgr API change, but in general, the decision to recreate-all-VMs should likely be configurable on a deployment-by-deployment basis.Thoughts?