vatesfr / terraform-provider-xenorchestra

Xen Orchestra provider for Terraform
MIT License
150 stars 33 forks source link

Enhance `xenorchestra_vm` to provide a method for graceful termination #222

Open ddelnano opened 1 year ago

ddelnano commented 1 year ago

This is a continuation of #212.

I discovered that #220 needed to be addressed to avoid a race condition that existed when the XO client inside the provide tried to shutdown Vms. In addition to this, I decided that this was better to provide as an opt in rather than the default behavior.

Testing

$ TEST=TestAccXenorchestraVm_gracefulTermination make testacc

=== RUN TestAccXenorchestraVm_gracefulTermination === PAUSE TestAccXenorchestraVm_gracefulTermination === RUN TestAccXenorchestraVm_gracefulTerminationForShutdownVm === PAUSE TestAccXenorchestraVm_gracefulTerminationForShutdownVm === CONT TestAccXenorchestraVm_gracefulTermination === CONT TestAccXenorchestraVm_gracefulTerminationForShutdownVm === CONT TestAccXenorchestraVm_gracefulTermination resource_xenorchestra_vm_test.go:966: Step 2/2 error: Error running destroy: exit status 1

    Error: mock client did not receive a stopped Vm. Graceful termination was bypassed!

testing_new.go:63: Error running post-test destroy, there may be dangling resources: exit status 1

    Error: mock client did not receive a stopped Vm. Graceful termination was bypassed!

--- FAIL: TestAccXenorchestraVm_gracefulTermination (42.79s) --- PASS: TestAccXenorchestraVm_gracefulTerminationForShutdownVm (109.80s) FAIL

ddelnano commented 1 year ago

@4censord please review this to verify it will work for your use cause. The only functional difference should be that this functionality must be opted into with a new xenorchestra_vm resource argument -- use_graceful_termination.

After discovering #220, setting this to the default has the potential to significantly slow down the tests and the general case. From running the test suite a few times against #212, the default 2 min timeout for validating PV drivers are present causes more flakiness in the build and longer test times.

I think it should be expected that if terraform is to destroy resources that it may not be graceful. Therefore, I think making this opt in is the best solution.

4censord commented 1 year ago

This would work for me.

I would still argue that it should be the default behavior.

I think it should be expected that if terraform is to destroy resources that it may not be graceful. Therefore, I think making this opt in is the best solution.

Most other providers seem to work this way.

But i understand that it is not practical to do so right now

When attempting to shut down a vm without the management agent installed, this now times out. But it does not clearly log the problem:

xenorchestra_vm.vm: Still destroying... [id=361d045e-c9cf-661d-4863-1d9f5d38681e, 2m0s elapsed]
╷
│ Error: failed to gracefully halt the vm with id: 361d045e-c9cf-661d-4863-1d9f5d38681e and error: timeout while waiting for state to become 'true' (last state: 'false', timeout: 2m0s)
│ 
│ 
╵
4censord commented 1 year ago

Changing the use_graceful_termination attribute takes almost 30s:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # xenorchestra_vm.vm will be updated in-place
  ~ resource "xenorchestra_vm" "vm" {
        id                       = "361d045e-c9cf-661d-4863-1d9f5d38681e"
        tags                     = [
            "dev",
        ]
      ~ use_graceful_termination = true -> false
        # (20 unchanged attributes hidden)

        # (3 unchanged blocks hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.
xenorchestra_vm.vm: Modifying... [id=361d045e-c9cf-661d-4863-1d9f5d38681e]
xenorchestra_vm.vm: Still modifying... [id=361d045e-c9cf-661d-4863-1d9f5d38681e, 10s elapsed]
xenorchestra_vm.vm: Still modifying... [id=361d045e-c9cf-661d-4863-1d9f5d38681e, 20s elapsed]
xenorchestra_vm.vm: Modifications complete after 26s [id=361d045e-c9cf-661d-4863-1d9f5d38681e]

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
ddelnano commented 1 year ago

Most other providers seem to work this way.

It seems vsphere does this, but uses a 3 min timeout by default. What other providers have you seen?

When attempting to shut down a vm without the management agent installed, this now times out.

220 may need a slightly different heuristic. I'll see if the XO api exposes the use of PV drivers rather than relying on the management agent presence. I think we should handle the case were there aren't PV drivers potentially. Allowing people to opt into a forceful stop and using that if the graceful method fails or timeouts could satisfy that, but I wanted to see how XO handles things.

Changing the use_graceful_termination attribute takes almost 30s

Modifying Vms unfortunately relies on a 25 second sleep (source). That's another area where there is a race condition and not being able to detect the value has persisted causes problems for the test suite.

I don't think I can fix that in this PR, but I can look into that early next year.

4censord commented 1 year ago

It seems vsphere does this, but uses a 3 min timeout by default. What other providers have you seen?

Does or attempts graceful termination:

Does not do graceful termination:

Changing the use_graceful_termination attribute takes almost 30s

Modifying Vms unfortunately relies on a 25 second sleep (source).

Oh, i see.

ddelnano commented 1 year ago

223 will address the regression with requiring the management agent and will properly detect PV drivers

ddelnano commented 1 year ago

I think I'd like to take another pass at this and implement things closer to how hyperv works:

Having forceful shutdown should prevent the test suite flakiness that I experienced when setting the graceful shutdown as the default. So hopefully that should be more in line what you initially were advocating for.