TestKitchen refuses to destroy VMs that failed to finish the create step

jayhendren commented 6 years ago

Description

If something fails after a VM is created but still during the kitchen create step, TestKitchen thinks the VM hasn't been created and will refuse to destroy it.

Kitchen Version

[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]% kitchen --version
Test Kitchen version 1.22.0

ChefDK Version

[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]% chef --version
Chef Development Kit Version: 3.1.0
chef-client version: 14.2.0
delivery version: master (6862f27aba89109a9630f0b6c6798efec56b4efe)
berks version: 7.0.4
kitchen version: 1.22.0
inspec version: 2.1.72

Platform Version

Arch Linux

Replication Case

Start with no running VMs:

[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]% kitchen list
Instance         Driver   Provisioner  Verifier   Transport  Last Action    Last Error
default-cub-rh7  Vagrant  ChefZero     CubBusser  Sftp       <Not Created>  <None>
[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]% vboxmanage list vms | grep rh7
[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]%

Then during an attempt to create a VM via TestKitchen, the create step fails after the VM comes up:

[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]% kitchen create
-----> Starting Kitchen (v1.22.0)
-----> Creating <default-cub-rh7>...
       ==> default: Checking for updates to 'cub_rh7'
           default: Latest installed version: 7.5-13
           default: Version constraints: 
           default: Provider: virtualbox
       ==> default: Box 'cub_rh7' (v7.5-13) is running the latest version.
       Bringing machine 'default' up with 'virtualbox' provider...
       ==> default: Importing base box 'cub_rh7'...
==> default: Matching MAC address for NAT networking...
       ==> default: Checking if box 'cub_rh7' is up to date...
       ==> default: Setting the name of the VM: default-cub-rh7_default_1537467634370_16029
       ==> default: Fixed port collision for 22 => 2222. Now on port 2200.
       ==> default: Clearing any previously set network interfaces...
       ==> default: Preparing network interfaces based on configuration...
           default: Adapter 1: nat
       ==> default: Forwarding ports...
           default: 22 (guest) => 2200 (host) (adapter 1)
       ==> default: Running 'pre-boot' VM customizations...
       ==> default: Booting VM...
       ==> default: Waiting for machine to boot. This may take a few minutes...
           default: SSH address: 127.0.0.1:2200
           default: SSH username: vagrant
           default: SSH auth method: private key
       ==> default: Machine booted and ready!
       ==> default: Registering box with vagrant-registration...
       Organization bogus does not exist.
>>>>>> ------Exception-------
>>>>>> Class: Kitchen::ActionFailed
>>>>>> Message: 1 actions failed.
>>>>>>     Failed to complete #create action: [Expected process to exit with [0], but received '1'
---- Begin output of vagrant up --provider virtualbox ----
STDOUT: Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'cub_rh7'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'cub_rh7' is up to date...
==> default: Setting the name of the VM: default-cub-rh7_default_1537467634370_16029
==> default: Fixed port collision for 22 => 2222. Now on port 2200.
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
==> default: Forwarding ports...
    default: 22 (guest) => 2200 (host) (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2200
    default: SSH username: vagrant
    default: SSH auth method: private key
==> default: Machine booted and ready!
==> default: Registering box with vagrant-registration...
STDERR: Organization bogus does not exist.
---- End output of vagrant up --provider virtualbox ----
Ran vagrant up --provider virtualbox returned 1] on default-cub-rh7
>>>>>> ----------------------
>>>>>> Please see .kitchen/logs/kitchen.log for more details
>>>>>> Also try running `kitchen diagnose --all` for configuration

kitchen create  4.09s user 1.74s system 7% cpu 1:13.90 total
[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]%

Consequently, the VM exists according to VirtualBox, but does not exist according to TestKitchen:

[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]% vboxmanage list vms | grep rh7
"default-cub-rh7_default_1537467634370_16029" {5ac5f267-f698-4dcf-89ec-6838d53d1d44}
[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]% kitchen list
Instance         Driver   Provisioner  Verifier   Transport  Last Action    Last Error
default-cub-rh7  Vagrant  ChefZero     CubBusser  Sftp       <Not Created>  Kitchen::ShellOut::ShellCommandFailed

And even after a kitchen destroy, the VM still exists:

[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]% kitchen destroy
-----> Starting Kitchen (v1.22.0)
-----> Destroying <default-cub-rh7>...
       Finished destroying <default-cub-rh7> (0m0.00s).
-----> Kitchen is finished. (0m0.19s)
[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]% kitchen list
Instance         Driver   Provisioner  Verifier   Transport  Last Action    Last Error
default-cub-rh7  Vagrant  ChefZero     CubBusser  Sftp       <Not Created>  <None>
[birdsnest ~/Work/scratchpad/cub_auth](master|✔)[I]% vboxmanage list vms | grep rh7
"default-cub-rh7_default_1537467634370_16029" {5ac5f267-f698-4dcf-89ec-6838d53d1d44}

I see the same behavior if create fails during kitchen test --destroy=always as well.

I expect that kitchen create; kitchen destroy or kitchen test --destroy=always will always result in no running VMs leftover, but this behavior does not match that expectation.

cheeseplus commented 6 years ago

The first step is making sure destroy double checks for orphans. Then I can look at perhaps creating another state as Vagrant still creates a VM despite failure - for the most part other drivers like cloud ones destroy the instance if anything bad happens during create. It's less that it refuses to destroy and more it doesn't know it has been created because it receives the error code back from Vagrant.

This might require work in core as well but that's after making destroy more robust for Vagrant specifically.

Rudikza commented 4 years ago

It looks like this issue still exists. @cheeseplus are you still working on this or could I perhaps give it a try?

cheeseplus commented 4 years ago

I am not working on it and I've been mostly out of the ecosystem for over a year so go nuts!

jayhendren commented 4 years ago

It's less that it refuses to destroy and more it doesn't know it has been created because it receives the error code back from Vagrant.

Maybe it's just that I have misplaced expectations, but I would expect an explicit kitchen destroy to at least attempt a destroy regardless. What it does right now is not make an attempt because of a false assumption about the state of the VM.

cheeseplus commented 4 years ago

It's a reasonable user facing expectation, I was speaking to the problem of an abstraction depending on another abstraction - at some point if you're spending more time trying to catch exceptions of an underlying abstraction then perhaps the model was inherently a poor choice.

The false assumption is due to the nature of how Vagrant fails and because kitchen has a limited window into introspecting information during that initial create window. Aside from catching and parsing the error from Vagrant, an alternative would be some relative path checking alongside vagrant global-status parsing. This works but only limitedly - kitchen doesn't track the state of it's own config across multiple runs so moves, additions, or removals also fall through in terms of what kitchen knows to cleanup - mentioning this here for some illustration and illumination of how convoluted it can get when trying to do the "most reasonable" thing. Cloud integrations like ec2 are a bit easier to reason with as we're talking to an API and thus ask questions in a common language as opposed to shelling out which is also a notable part of the problem with the existing implementation.

This can be mitigated but at the cost of bolstering each plugin (vagrant in this case) with more discrete updates to it's state tracking. This is doable but as one who has chased a bunch of these types of issues in the ecosystem it requires either a) adding edgecases that are already mirrored in the underlying abstraction ad infinitum in each plugin or b) rethinking how TK manages state holistically.

Elaborating here because these are contributing factors as to why there has been little movement on issues like this - it's solvable (for some value of solvable) but it's diminishing returns and rabbit holes, especially in the case of the kitchen-vagrant abstraction.

marcparadise commented 3 years ago

Confirmed that this is still occurring. We wil take it into the backlog though we're also interested in community contributions to resolve this, as fixing this in a clean way is a larger undertaking.

test-kitchen / kitchen-vagrant