hashicorp / vagrant

Vagrant is a tool for building and distributing development environments.
https://www.vagrantup.com
Other
26.24k stars 4.44k forks source link

Every few days vagrant can't connect to vmware rest API, vmrest process becomes a zombie #11955

Open stolowski opened 4 years ago

stolowski commented 4 years ago

Vagrant version

2.2.10

VMware Fusion 12

Host operating system

Mac OS 10.15.7 (19H2)

Guest operating system

Linux, Windows 10.

Vagrantfile

N/A

Debug output

N/A

Expected behavior

vagrant commands (up, ssh ...) work all the time.

Actual behavior

Every few days vagrant commands stop working and error out accessing vmware REST API. I can confirm vmrest API is not available by connecting with a web browser at the given http port (when it works, then I can browse these APIs with a web browser).

When this happens, vmrest process appears to be a zombie:

$ ps aux|grep vmrest                                                  
root             50024   0,0  0,0        0      0   ??  Z     6:11     0:00.00 (vmrest)

Normally (when it works, it looks like this:

$ ps aux|grep vmrest
root               344   0,0  0,1  5316588  34984   ??  S     6:23     0:00.33 /Applications/VMware Fusion.app/Contents/Library/vmrest

Steps to reproduce

No clear steps. Happens randomly every few days, I keep my Mac running for several days and only close the lid overnight. I don't think it is strictly related to waking it from sleep in the morning, it also happened close to end of my work day.

References

I've created a VMware support ticket as well, but it is at early stages yet.

I suspect the problem might be with VMware/vmrest; on the other hand it seems that vmrest is started on demand (triggered / spawned by vagrant commands), and if vmrest becomes a zombie, it may suggest it's parent process quit abruptly. I'd appreciate any thoughts or ideas on debugging this, or any suggestions about how to recover (e.g. if restarting any daemons would help), as at the moment, in order to recover and have vagrant operational again I have to restart my system whenever this happens. I think this issue started happening after upgrading to Fusion 12 (and upgrading vagrant for it).

chrisroberts commented 3 years ago

@stolowski I'll be looking to reproduce this behavior today and hopefully identify and resolve the issue fairly quickly.

stolowski commented 3 years ago

@stolowski I'll be looking to reproduce this behavior today and hopefully identify and resolve the issue fairly quickly.

Thank you! To be honest, I'm a little skeptical when it comes to reproducing it in a short time; it just happened here again after a few days, and after a few hours of working today without experiencing the problem. I'm attaching vagrant debug in case it shows anything that may give a clue...

If you have any suggestions where to look for more logs/debug, please let me know.

vagrant.log

stolowski commented 3 years ago

This just happened again, so 5 days from previous occurence. After restarting the system I took a look at the processes with pstree and under normal conditions I see this:

 |-+= 00318 root /opt/vagrant-vmware-desktop/bin/vagrant-vmware-utility api -port=9922 -config-file=/opt/vagrant-vmware-desktop/config/service.hcl
 | \--- 02980 root /Applications/VMware Fusion.app/Contents/Library/vmrest

which means vagrant-vmware-utility is in control of vmrest. Is there a service related to vagrant-vmware-utility that I could restart as a workaround when this happens?

stolowski commented 3 years ago

@chrisroberts I had a hunch and checked Mac's Console app and found two crash reports for vagrant-vmware-utility that match the times where vmrest wouldn't work anymore; attached below.

crash1.txt crash2.txt

stolowski commented 3 years ago

This just happened again. I was hoping sudo launchctl start com.vagrant.vagrant-vmware-utility would work around the problem, but it didn't help (the service would start, but vmrest does not respond). @chrisroberts Is there any workaround you can think of that I could try next time? Any logs or other information I should get when it happens again? This is a very frustrating problem...

stolowski commented 3 years ago

I know recent update to vagrant/vagrant-vmware-utility was aimed at fixing the issue of zombie vmrest process and I think it indeed fixed it, but I just experienced the problem again - this time there there was no vmrest zombie process, vagrant-vmware-utility was running, but it didn't not (re)start vmrest, so I was getting the usual "Get http://localhost:49743/api/vmnet: GET http://localhost:49743/api/vmnet giving up after 5 attempts" error from vagrant commands.

This time however I was able to recover with:

$ sudo launchctl stop com.vagrant.vagrant-vmware-utility
$ sudo launchctl start com.vagrant.vagrant-vmware-utility

which also started vmrest.

So I think this is going in the right direction but it would be great if com.vagrant.vagrant-vmware-utility service automatically restarted vmrest if needed.

chrisroberts commented 3 years ago

@stolowski had you put the machine to sleep between the time it was functioning correctly, and when it began to error with the vmrest process not being started?

Also, if you could toss in the logs from /Library/Application Support/vagrant-vmware-utility/service.log around when you experience this, that would be great. I'll work on trying to repro this again with the latest utility.

Thanks!

stolowski commented 3 years ago

@chrisroberts No, I didn't put the machine to sleep around the time the issue occurred. I woke it in the morning around 9:00, started a vm with vagrant up followed by vagrant ssh, then worked on it over ssh for a couple of hours (without issuing any vagrant commands for around 4 hours which seems to be confirmed by the log). The problem occurred around 13:09 when I tried to issue a vagrant command.

Attaching the log for that day, I think you should be looking for 2020/11/13 13:09. service.log.gz

(there are a couple of vagrant-vmware-utility restarts after that time which are my attempts to recover and some playing with automating the recovery).

uberjay commented 3 years ago

Ah, I've been running into this for quite some time now, so it's helpful to see what the cause is. I'd gotten so far as figuring out that the vmrest was in zombie state (a couple months ago), but assumed it was a VMware bug and didn't have any more time to spend tracking it down.

This time, when it failed, I was able to restore vagrant vmware rest api service by restarting com.vagrant.vagrant-vmware-utility with launchctl -- thanks for that!

If there's any additional debug info you might need, I could poke around as well.

stolowski commented 3 years ago

FWTW I haven't experienced this problem with vagrant 2.2.14 for good 3 weeks, but I've also upgraded to Big Sur on the day it was officially released (on a side note, I don't recommend switching to Big Sur without doing some research/reading first, i.e. there are serious issues with nested virtualization with Fusion 12 on it). All my previous problems related to that issue were from Catalina.

UPDATE: oh well, I spoke too soon, I just hit it again shortly after waking my MBP in the morning.... The workaround with restarting the service worked.

kernelsmith commented 1 year ago

This has been happening to me and those on my team for years, at least as far back as this issue report. This happens so often that we have a shell script that runs:

sudo launchctl unload -w /Library/LaunchDaemons/com.vagrant.vagrant-vmware-utility.plist
sudo launchctl load -w /Library/LaunchDaemons/com.vagrant.vagrant-vmware-utility.plist
uberjay commented 1 year ago

This has been happening to me and those on my team for years, at least as far back as this issue report. This happens so often that we have a shell script that runs:

Yep, me too. It's just the normal part of the routine for me. Eventually, routine will transition into tradition. It'll feel good and right and true, but we'll have long since forgotten why we're doing it!! 😁