hashicorp / vagrant

Vagrant is a tool for building and distributing development environments.
https://www.vagrantup.com
Other
26.18k stars 4.43k forks source link

Flaky timeout during vagrant up - Timed out while waiting for the machine to boot #13062

Open nstng opened 1 year ago

nstng commented 1 year ago

Hi, I found a lot of issues regarding "Timed out while waiting for the machine to boot" - but none that really fits our current problem. As our setup is maybe a little bit special, I mostly hope for hints how to debug this.

We are using vagrant 2.3.4 with VirtualBox 6.1.38r153438 on macOS 12.6.2 in GitHub CI (i.e., specs are form here).

In our workflow we bring up multiple VMs. The guests are based on modified ubuntu focal base images that we load from the vagrant cloud. In ~ one out of ten runs we get (i.e., this issue is flaky)

my_vm: SSH auth method: private key
Timed out while waiting for the machine to boot. This means that
Vagrant was unable to communicate with the guest machine within
the configured ("config.vm.boot_timeout" value) time period.

The timeout happens after the default 5 Minutes.

In runs without this issue we see that the ssh auth (boot) is successful in <1 Minute, i.e., just waiting longer might not be the solution here. Example:

Fri, 13 Jan 2023 16:43:36 GMT     my_vm: SSH auth method: private key
Fri, 13 Jan 2023 16:44:08 GMT ==> my_vm: Machine booted and ready!

For testing/debugging we implemented a re-try mechanic -> try max three times to build the vm - in case of a failure vagrant destroy the vm and try again. If the problem occurs with this, then we actually get the timeout three times - makes me think that something in the environment is causing this.

Any hints for debugging this are appreciated. Let me know what additional information can be helpful.

Expected behavior

Booting the VMs successfully in all CI runs.

Actual behavior

~one of out ten failures as described above

Reproduction information

Vagrant version

vagrant 2.3.4 with VirtualBox 6.1.38r153438

Host operating system

macOS 12.6.2 in GitHub CI

Guest operating system

Ubuntu focal (modified base image)

Steps to reproduce

See above

Vagrantfile

will provide if answers suggests that this is relevant

appst commented 1 year ago

i too have sporadic boot failures. i have written a bash script that loops through a minimal vm spec with "vagrant up", "vagrant halt", and "vagrant destroy -f". i added time delays between all stages. most of the time it works, but it always eventually ends in failure. a similar bash script using only virtualbox commands to the same affect does not fail. the gui is completely black, which i believe indicates that the vm is not even booting. debug traces have not enlightened me. i believe there may be a bug lurking in there somewhere.

appst commented 1 year ago

PS: i have tested this on four systems, all with ample resources like AMD 5950x with 64GB memory three of the systems use Windows 10 one uses Ubuntu 20.04.1 (KDE) all use Virtualbox 6.1.40r154048 and Vagrant 2.3.4 all eventually fail using Vagrant in a loop

appst commented 1 year ago

PS: the guest OS i am using is Ubuntu 20.04 that probably is not relevant the important thing is that it works most time, but eventually fails

appst commented 1 year ago

BTW: this has plagued me for months, if not years. i have always "fixed" it by simply re-provisioning a failed attempt lately, i am trying to fully automate Vagrant, and thus have been doing this testing. this needs to be resolved!

phinze commented 1 year ago

Hi @nstng and @appst - sorry to hear about the sporadic failures you're seeing. Sporadic issues are always tough to debug, so more information will be helpful for us to narrow it down. Can either of you share a minimal vagrantfile that reproduces the timeouts for you and/or the debug output from one of the timeouts?

appst commented 1 year ago

HI Paul,

I am in the midst of doing further testing on this and I will get back to you where it all ends up.

Thanks! Rolande Kendal

On Fri, 20 Jan 2023 at 14:18, Paul Hinze @.***> wrote:

Hi @nstng https://github.com/nstng and @appst https://github.com/appst

  • sorry to hear about the sporadic failures you're seeing. Sporadic issues are always tough to debug, so more information will be helpful for us to narrow it down. Can either of you share a minimal vagrantfile that reproduces the timeouts for you and/or the debug output from one of the timeouts?

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/vagrant/issues/13062#issuecomment-1398827563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANF2Z6OQ2KTC65HF5WTZOTWTLQGRANCNFSM6AAAAAAT4QBH3Y . You are receiving this because you were mentioned.Message ID: @.***>

nstng commented 1 year ago

Hi @appst, thank you - I appreciate if you take over providing a minimal example. As said, our setup is complex and it would take me some time to cut out unnecessary parts and test them. Let me know if I should take over again.

appst commented 1 year ago

Hi Nils,

My setup is rather complex too, but I am working diligently towards narrowing down my problem right now. I am days into trying to simplify and isolate things, and I am still looking for the smoking gun. My testing is basically running hundreds of iterations of launching a vagrant vm. What i have noticed so far, in my environment, is that a generic box from hashicorp or canonical do not hang, while what i build with Packer eventually hangs 2, or 10, or 50, or so iterations in. I don't know if that relates to the problems you are experiencing.

i will let you know where things lead.

Kendal

On Mon, 23 Jan 2023 at 03:41, Nils Semmelrock @.***> wrote:

Hi @appst https://github.com/appst, thank you - I appreciate if you take over providing a minimal example. As said, our setup is complex and it would take me some time to cut out unnecessary parts and test them. Let me know if I should take over again.

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/vagrant/issues/13062#issuecomment-1399977152, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANF2ZZU6LMO3JZU5CXVZ2DWTY727ANCNFSM6AAAAAAT4QBH3Y . You are receiving this because you were mentioned.Message ID: @.***>

hholzgra commented 1 year ago

I have similar problems, but mostly only when starting multiple VMs at around the same time.

I am building base boxes for each minor release of our own software we ever did, using a GNU make infrastructure, and doing a non-parallel build this usually succeeds for all 350+ VMs, but obviously takes its time.

When utilizing all cores, e.g. with "make -j8" on my 8 core AMD Ryzen desktop machine, I'll only see one or two VM boxes failing to build on a good day, which then succeed in a second run. On a bad day it is more like one in ten or more failing, and it sometimes takes running "make" three or four times until all boxes have been rebuilt.

I've tried to add some random delay of 0 to 20 seconds at the start of the actual build script invoked by "make", so that "vagrant up" of multiple machines is less likely to happen at the very same second, and that seems to have improved the situation a little bit, but not fully ...

I can reproduce this on three different machines, one AMD Ryzen 9, one AMD Ryzen 7, and one Intel Core i7, all running latest Vagrant and VirtualBox by now, on either Ubuntu 20.04 or 22.04.

Guest OSes mostly being different Ubuntu LTS releases, and a very small number of other Linux distributions (Debian, OpenSuse, CentOS, Alma and RockyLinux)

appst commented 1 year ago

Hi Paul,

I have been diligently trying to dig to the bottom of this issue. My recent tests have launched into the thousands of instances. Note that I am talking about serially launched, with the previous one fully shut down before the next begins.

I posted my findings thus far here: https://forums.virtualbox.org/viewtopic.php?f=7&t=108454

I'm not confident it is the fault of virtualbox however, but perhaps how vagrant uses virtualbox. The reason I say that is because my tests include launching instances with VBoxManga alone and they don't hang. It is when I launch the box, which incorporates the same vmdk file, through vagrant that the hanging begins. Sometimes it hangs on the first or second instance. Sometimes it won't happen until the fiftieth, or so, instance.

As a pure shot-in-the-dark question... the following line in the log of a failed instance is as follows...

00:00:05.923578 GIM: KVM: VCPU 0: Enabled system-time struct.

I just know that an instance with zero cpus allotted to it would not boot and that perhaps another line with "VCPU 1" should appear in the logs, but does not.

I thank you for your interest in this, and let me know if I can help from my end.

kendal

On Fri, 20 Jan 2023 at 14:18, Paul Hinze @.***> wrote:

Hi @nstng https://github.com/nstng and @appst https://github.com/appst

  • sorry to hear about the sporadic failures you're seeing. Sporadic issues are always tough to debug, so more information will be helpful for us to narrow it down. Can either of you share a minimal vagrantfile that reproduces the timeouts for you and/or the debug output from one of the timeouts?

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/vagrant/issues/13062#issuecomment-1398827563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANF2Z6OQ2KTC65HF5WTZOTWTLQGRANCNFSM6AAAAAAT4QBH3Y . You are receiving this because you were mentioned.Message ID: @.***>

phinze commented 1 year ago

Hi Kendal,

Thanks for the continued updates on your testing. You're asking the same question that was in my mind about the hangs - is this at the VirtualBox layer or the Vagrant layer. The difference in behavior between vanilla VBoxManage and vagrant invocations of the same VMDK is interesting.

This is also quite interesting:

What i have noticed so far, in my environment, is that a generic box from hashicorp or canonical do not hang, while what i build with Packer eventually hangs 2, or 10, or 50, or so iterations in.

I'd love it if you could share a minimal Vagrantfile and VMDK (or Packer config) that reproduces the issue so we could try and reproduce the hang on our side.

As for your question about the KVM: VCPU 0 line: I'm pretty sure that VCPUs are zero-indexed, so I would expect that to be a line referencing the first VCPU. It's curious that the line does not show up in your successful runs though.

Thanks for all your work on this so far!

appst commented 1 year ago

Paul,

The different log clippings I pointed out are from running "vagrant up" There is nothing to report using vanilla VBoxManage directly since that would not hang.

My build process is currently riddled with tests and changes for this issue. I currently have all four of my test systems running without a hang for the last hour. That has never happened before. So, I may have isolated the issue. The most fundamental change is an epiphany I had about the box.ovf file.

My box files are built in these stages 1) I make a box with Packer and add it to the local Vagrant install 2) I create and instance with the Packer box and further provision it 3) I export the box with "VBoxManage --export" 4) I tar up the exported files into a box file

I was under the assumption that the box.ovf file was static- I was not aware of the fact that VBoxManage exported a different copy that was updated by Virtualbox through the provisioning that I did outside of Packer. I was thinking it was just the vmdk file that would change. In my environment, box.ovf was being updated to include kvm paravirtualization. I was then launching an instance with that box on systems that have no kvm. If this is the real problem, then horay! I can clean-up and move on. However, If that's the case then it is curious how instantiations on other systems mostly worked rather than consistently failing due to bad configuration. It is also curious how I was noticing the hangs on the same system that the box was generated on that has kvm. No cigars yet!

I am now using the original Packer box.ovf in my final box file. I have a lot of clean-up to do to know whether or not that change alone made the difference.

I greatly appreciate your willingness to help out with this. In a few days, after I clean things up, I will let you know my final outcome.

Perhaps my error with box.ovf may have aggravated things to cause an issue that you want to look into anyway - it seemed like the kvm handling could have been inconsistent in all iterations of my testing. If you wish to pursue that then let me know and I will send you whatever I can to help.

Cheerz! Kendal

On Fri, 27 Jan 2023 at 14:34, Paul Hinze @.***> wrote:

Hi Kendal,

Thanks for the continued updates on your testing. You're asking the same question that was in my mind about the hangs - is this at the VirtualBox layer or the Vagrant layer. The difference in behavior between vanilla VBoxManage and vagrant invocations of the same VMDK is interesting.

This is also quite interesting:

What i have noticed so far, in my environment, is that a generic box from hashicorp or canonical do not hang, while what i build with Packer eventually hangs 2, or 10, or 50, or so iterations in.

I'd love it if you could share a minimal Vagrantfile and VMDK (or Packer config) that reproduces the issue so we could try and reproduce the hang on our side.

As for your question about the KVM: VCPU 0 line: I'm pretty sure that VCPUs are zero-indexed, so I would expect that to be a line referencing the first VCPU. It's curious that the line does not show up in your successful runs though.

Thanks for all your work on this so far!

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/vagrant/issues/13062#issuecomment-1406984368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANF2Z4Q3UAPXRYIT6RCSULWUQPKRANCNFSM6AAAAAAT4QBH3Y . You are receiving this because you were mentioned.Message ID: @.***>

nstng commented 1 year ago

Hi, the following is not really a minimal example, but some statistics what we see in our setup.

setup

The following runs are based on a test workflow on my fork, where

runs

appst commented 1 year ago

Hi Paul,

The majority of my Vagrant provisioning tests have involved the following...

vagrant init mybox

do { vagrant up sleep 60 vagrant halt -f sleep 30 vagrant destroy -f sleep 30 } while (true)

What I have noticed is that boxes built with Virtualbox --firmware=bios fail to boot occasionally (say after 10, 20, 50, or so, tries), meanwhile boxes built with Virtualbox --firmware=efi have never failed yet, and it has been a couple of thousand iterations I have done so far.

I am going to use the efi firmware from now on and see where that leads.

Kendal

On Fri, 27 Jan 2023 at 14:34, Paul Hinze @.***> wrote:

Hi Kendal,

Thanks for the continued updates on your testing. You're asking the same question that was in my mind about the hangs - is this at the VirtualBox layer or the Vagrant layer. The difference in behavior between vanilla VBoxManage and vagrant invocations of the same VMDK is interesting.

This is also quite interesting:

What i have noticed so far, in my environment, is that a generic box from hashicorp or canonical do not hang, while what i build with Packer eventually hangs 2, or 10, or 50, or so iterations in.

I'd love it if you could share a minimal Vagrantfile and VMDK (or Packer config) that reproduces the issue so we could try and reproduce the hang on our side.

As for your question about the KVM: VCPU 0 line: I'm pretty sure that VCPUs are zero-indexed, so I would expect that to be a line referencing the first VCPU. It's curious that the line does not show up in your successful runs though.

Thanks for all your work on this so far!

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/vagrant/issues/13062#issuecomment-1406984368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANF2Z4Q3UAPXRYIT6RCSULWUQPKRANCNFSM6AAAAAAT4QBH3Y . You are receiving this because you were mentioned.Message ID: @.***>

hicham1se commented 1 year ago

I have the same problem with the same box : ubuntu/focal64. I think if everyone can specify the box they're working with, we can understand more about this vm behavior.

appst commented 1 year ago

My impressions are that it is not a box issue. It is my current feeling that this is an issue with how Vagrant handles Virtualbox. The reason I say so is that my testing of pure Virtualbox with my.vmdk does not fail, Using the same my.vmdk, my testing with Vagrant using Virtualbox using Hyper-V DOES NOT fail, But my testing with Vagrant using Virtualbox not using Hyper-V DOES eventually fail.

I reserve the right to change my mind later. ;-)

On Sat, 18 Mar 2023 at 12:52, Hicham1se @.***> wrote:

I have the same problem with the same box : ubuntu/focal64. I think if everyone can specify the box they're working with, we can understand more about this vm behavior.

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/vagrant/issues/13062#issuecomment-1474907714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANF2Z5EITWISLN65RK4GFDW4XR35ANCNFSM6AAAAAAT4QBH3Y . You are receiving this because you were mentioned.Message ID: @.***>

hicham1se commented 1 year ago

I'm still working on it, now I'm try to force a manual password prompt directly on vm but, now I'm starting to think it's vm's problem. But let's give it some time to come clear.

And I reserve the right to change my mind as well. B⁠-⁠)

On Sun, Mar 19, 2023, 1:03 PM r kendal @.***> wrote:

My impressions are that it is not a box issue. It is my current feeling that this is an issue with how Vagrant handles Virtualbox. The reason I say so is that my testing of pure Virtualbox with my.vmdk does not fail, Using the same my.vmdk, my testing with Vagrant using Virtualbox using Hyper-V DOES NOT fail, But my testing with Vagrant using Virtualbox not using Hyper-V DOES eventually fail.

I reserve the right to change my mind later. ;-)

On Sat, 18 Mar 2023 at 12:52, Hicham1se @.***> wrote:

I have the same problem with the same box : ubuntu/focal64. I think if everyone can specify the box they're working with, we can understand more about this vm behavior.

— Reply to this email directly, view it on GitHub < https://github.com/hashicorp/vagrant/issues/13062#issuecomment-1474907714 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AANF2Z5EITWISLN65RK4GFDW4XR35ANCNFSM6AAAAAAT4QBH3Y

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/vagrant/issues/13062#issuecomment-1475363302, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXXZS6LOHB4XGQD6SBIBOMLW45J7VANCNFSM6AAAAAAT4QBH3Y . You are receiving this because you commented.Message ID: @.***>

hholzgra commented 1 year ago

I have been seeing the problem with all kinds of ubuntu/*64 base boxes, at least back to ubuntu/trusty64, and also with centos/6 and centos/7 base boxes, too.

appst commented 1 year ago

i think the box is a red herring. it is my belief that any box will eventually fail with vagrant using virtualbox and its native hypervisor

On Sun, 19 Mar 2023 at 15:40, Hartmut Holzgraefe @.***> wrote:

I have been seeing the problem with all kinds of ubuntu/*64 base boxes, at least back to ubuntu/trusty64, and also with centos/6 and centos/7 base boxes, too.

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/vagrant/issues/13062#issuecomment-1475377012, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANF2Z5AVVNWFEWA7BWMI53W45OLDANCNFSM6AAAAAAT4QBH3Y . You are receiving this because you were mentioned.Message ID: @.***>

meetAssassin commented 11 months ago

I have been seeing the problem with all kinds of ubuntu/*64 base boxes, at least back to ubuntu/trusty64, and also with centos/6 and centos/7 base boxes, too.

I have been having the same kind of issue with all ubuntu/64 boxes. The only ubuntu/64 box working for me is the ubuntu/jammy64 one. If you found any kind of solution, please let me know..

appst commented 10 months ago

Unfortunately I have never solved this issue. Currently, in development, I live with restarting an occasional crashed instance. For production, I know I have to migrate away from virtualbox/vagrant. If you have any better luck then I would be happy to hear about it!

cheerz

On Fri, 6 Oct 2023 at 05:21, Om Kathare @.***> wrote:

I have been seeing the problem with all kinds of ubuntu/*64 base boxes, at least back to ubuntu/trusty64, and also with centos/6 and centos/7 base boxes, too.

I have been having the same kind of issue with all ubuntu/64 boxes. The only ubuntu/64 box working for me is the ubuntu/jammy64 one. If you found any kind of solution, please let me know..

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/vagrant/issues/13062#issuecomment-1750266525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANF2Z3W5RYV6I5XCTK3FZLX57ERNAVCNFSM6AAAAAAT4QBH32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJQGI3DMNJSGU . You are receiving this because you were mentioned.Message ID: @.***>