hashicorp / vagrant

Vagrant is a tool for building and distributing development environments.
https://www.vagrantup.com
Other
26.02k stars 4.42k forks source link

Hyper-V Windows guest machine entered an invalid state while waiting for it to boot #12256

Open rgl opened 3 years ago

rgl commented 3 years ago

Vagrant version

Vagrant 2.2.14

Host operating system

Does not matter.

Guest operating system

Windows.

Vagrantfile

Does not matter which Vagrantfile we use.

Debug output

Not really debug output, but its enough to show the problem:

==> client: Starting the machine...
==> client: Waiting for the machine to report its IP address...
    client: Timeout: 120 seconds
    client: IP: 192.168.56.10
==> client: Waiting for machine to boot. This may take a few minutes...
    client: WinRM address: 192.168.56.10:5985
    client: WinRM username: Administrator
    client: WinRM execution_time_limit: PT2H
    client: WinRM transport: negotiate
The guest machine entered an invalid state while waiting for it
to boot. Valid states are 'running'. The machine is in the
'stopping' state. Please verify everything is configured
properly and try again.

Expected behavior

Vagrant should have waited until the WinRM communicator is available (regardless of the VM state) or timeout.

Actual behavior

Vagrant gave up too soon like described in the debug output above.

This happens because Windows restarts several times before vagrant can successfully connect to it. While doing so, hyper-v reports that stopping state (and possible others, which I do not known the exact name, like the hypothetical starting?).

This error seems to come from https://github.com/hashicorp/vagrant/blob/d70ff086af57af77f57369b696c80bcdfe73c6f5/plugins/providers/hyperv/action.rb#L157 which only allows the VM to be in the running state.

I would like to stop vagrant from checking the current state of the VM by removing the , [:running] part.

What do you think? If you agree I will submit a PR.

Please note that there is no straightforward way to enumerate/known all the possible state names of a Hyper-V VM, and those might change in the future, which would unnecessarily break vagrant.

PS I'm trying to use vagrant to create a empty VM, configure it to PXE boot, and use the windows deployment services (WDS) to drive the Windows installation. All of this almost works, except the part where vagrant tries to connect to the VM, which sometimes fails.

Steps to reproduce

  1. vagrant up
soapy1 commented 3 years ago

Hey there, thanks for opening up an issue!

TL;DR I think the [:running] bit, or the checking the vm state bit is important. It was added to resolve some issues . I think this issue can be resolved by adding a few more acceptable states, [:running, :starting, :resuming] and by increasing the vm_boot_timeout.

It looks like this part about waiting for a running state was added in order to fix a bug associated with restoring snapshots. It looks like it's still a useful thing to have here. Additionally, I think Vagrant should not continue to run if the vm is reachable by the communicator but has entered into the 'Stopping' state. In this case, the vm will soon be not reachable and Vagrant will error out.

Looking into hyper-v states I was able to find all the possibilities for available states:

>Import-Module Hyper-V
> [enum]::GetNames([Microsoft.HyperV.Powershell.VMState])
Other
Running
Off
Stopping
Saved
Paused
Starting
Reset
Saving
Pausing
Resuming
FastSaved
FastSaving
ForceShutdown
ForceReboot
Hibernated
ComponentServicing
RunningCritical
OffCritical
StoppingCritical
SavedCritical
PausedCritical
StartingCritical
ResetCritical
SavingCritical
PausingCritical
ResumingCritical
FastSavedCritical
FastSavingCritical

I think it could make sense to add some items to the list of valid states. Probably starting and resuming

... and those might change in the future, which would unnecessarily break vagrant.

While I think this is a very valid concern, I think the pros of checking the state of the vm outweighs the cons of being vulnerable to an api change by hyper-v. That is, checking the vm state assures that Vagrant will not bomb out later on for seemingly unknown reasons.

Vagrant should have waited until the WinRM communicator is available (regardless of the VM state) or timeout.

I think having a timeout here is important, I wouldn't want Vagrant to try to run indefinitely, say on a CI. You might also want to set the vm_boot_timeout.

pauby commented 2 years ago

If you watch the VM being created in whatever GUI you're using (I'm using Hyper-V) you'll see that Vagrant gives up before the VM has actually been properly provisioned. It's timing out waiting for the VM to start. I solved this by, in the Vagrantfile, changing:

cfg.windows.halt_timeout = 60

to:

cfg.windows.halt_timeout = 120

You may want to increase the value depending on the VM, your host machine etc. etc.

Note that the default value is 60, so you may not have the cfg.windows.halt_timeout value and still experience this. The solution there would be to add cfg.windows.halt_timeout = 120.

marcprodan commented 3 months ago

Hi @soapy1, is there any update on this topic? I am running into exactly the same issue as @rgl and find Vagrant barely usable with hyperV due to this unreliability. Adding additional acceptable states sounds like a reasonable solution. Are you sure that adding [:running, :starting, :resuming] would suffice, since :stopping is the state that raised the error accoding to the error log?

@pauby Adding cfg.windows.halt_timeout = 120 did not resolve this problem for me. As far as I can see this option does not exist (anymore) in the current Vagrant release. See windows/config.rb in src code. Have you found an alternative fix by any chance?

pauby commented 3 months ago

@marcprodan I stopped using Vagrant with Hyper-V some time ago as it was just too slow and not fit for (my) purpose. I used it with VMware which was a lot better and have since transitioned away from Windows entirely, and also Vagrant too as I found it had too many issues for me to keep fighting.

My advice would be that if you have any alternative to Hyper-V with Vagrant, I'd do that. I appreciate that's unlikely, but wanted to highlight the limitations and pain you're going to keep encountering. There is AutomatedLab which is built for Hyper-V (and Azure) and while I had a much better time with it, I was working in a cross-platform environment and couldn't use something Windows-only.

I'm sorry that isn't of much help to you, but I didn't want to ignore your comment.

marcprodan commented 3 months ago

@pauby Thanks for your feedback, I appreciate it a lot. Unfortunately, you are exactly right and I am stuck with Hyper-V. I will definitely have a look at AutomatedLab. Thanks for the suggestion. However, I have invested a lot of time into Vagrant (and Packer to create custom Vagrant boxes) already and this issue is the last missing puzzle piece to have a reliably working solution, so changing to an alternative tool would be quite painful aswell. (Sunk cost fallacy says hello.)