On `vagrant up` with an additional private network, the Ansible provisioner randomly fails on an SSH timeout

x-yuri commented 9 years ago

Not sure if it's vagrant's fault:

$ sed -i '/^192.168.88.10/d' ~/.ssh/known_hosts && vagrant destroy -f && vagrant up && TO=local mina create_keys
==> default: Forcing shutdown of VM...
==> default: Destroying VM and associated drives...
==> default: Running cleanup tasks for 'ansible' provisioner...
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'ubuntu/trusty64'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'ubuntu/trusty64' is up to date...
==> default: Setting the name of the VM: domain_default_1416649876282_50981
==> default: Clearing any previously set forwarded ports...
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
    default: Adapter 2: hostonly
==> default: Forwarding ports...
    default: 22 => 2222 (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Warning: Connection timeout. Retrying...
    default: Warning: Remote connection disconnect. Retrying...
    default: Warning: Remote connection disconnect. Retrying...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
==> default: Configuring and enabling network interfaces...
==> default: Mounting shared folders...
    default: /vagrant => /srv/http/domain
==> default: Running provisioner: ansible...
ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --vault-password-file=provisioning/.vault_pass_dev --inventory-file=provisioning/hosts.dev -vvvv --limit='all' provisioning/playbook.yml

PLAY [prepare] **************************************************************** 

GATHERING FACTS *************************************************************** 
<192.168.88.10> ESTABLISH CONNECTION FOR USER: vagrant
<192.168.88.10> REMOTE_MODULE setup
<192.168.88.10> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.10', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1416649927.12-208953686388775 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1416649927.12-208953686388775 && echo $HOME/.ansible/tmp/ansible-tmp-1416649927.12-208953686388775'"]
fatal: [domain.local.encrypted] => SSH encountered an unknown error. The output was:
OpenSSH_6.7p1, OpenSSL 1.0.1j 15 Oct 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: auto-mux: Trying existing master
debug1: Control socket "/home/yuri/.ansible/cp/ansible-ssh-192.168.88.10-22-vagrant" does not exist
debug2: ssh_connect: needpriv 0
debug1: Connecting to 192.168.88.10 [192.168.88.10] port 22.
debug2: fd 3 setting O_NONBLOCK
debug1: connect to address 192.168.88.10 port 22: Connection timed out
ssh: connect to host 192.168.88.10 port 22: Connection timed out

TASK: [prepare_system | set /etc/timezone] ************************************ 
FATAL: no hosts matched or all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/home/yuri/playbook.retry

domain.local.encrypted  : ok=0    changed=0    unreachable=1    failed=0   

Ansible failed to complete successfully. Any error output should be
visible above. Please fix these errors and try again.

This error happens occasionally (not every time). So, first vagrant says the machine is up and ready, but then ansible says that it can't connect. Can this be fixed in any way? Could you suggest anything to further investigate the issue?

gildegoma commented 9 years ago

You're using a custom inventory file (provisioning/hosts.dev), but Vagrant dynamically allocates the forwarded port for SSH connections.

Note that for simple use case, I'd recommend to rely on auto-generated inventory by simply removing ansible.inventory_path option in your Vagrantfile.

If you still encounter problems, could you please gist your Vagrantfile and provisioning/hosts.dev, and maybe a playbook example that can reproduce the issue.

x-yuri commented 9 years ago

I'll try to set up a testing VM. Meanwhile, could you elaborate on "but Vagrant dynamically allocates the forwarded port for SSH connections" part? Actually I don't really understand what's happening. As far as I can tell, vagrant always sets up port forwarding like this: 2222 (host) => 22 (guest). At least for this VM. And ansible seems to connect using port 22. What strikes me, why does ansible even happen to connect? Shouldn't it be using port 2222?

UPD Well, it must be because ansible connects using host-only network. But I still don't understand how Vagrant "allocates the forwarded port for SSH connections."

gildegoma commented 9 years ago

Well, it must be because ansible connects using host-only network

Yes indeed, my comment about port forwarding only referred to this host-only connection, which is Vagrant's default ssh path. In your case, you've defined a extra private (or public) network (192.168.88.10), where using default port 22 is absolutely correct.

With a quick Internet search, I found that there is several SSH issues when running Mac OS X Mavericks (and maybe affecting Yosemite as well). I could not check them, as I only have an OS X 10.9 box, but please have a look at:

If you still encounter problems, could you please provide following information:

Host Operating System details
Vagrant version
Ansible version
gist of your Vagrantfile
gist of your ansible inventory
gist of your ansible playbook

Hint: for faster debugging, you also can execute the ansible-playbook generated by Vagrant directly from the shell:

ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --vault-password-file=provisioning/.vault_pass_dev --inventory-file=provisioning/hosts.dev -vvvv --limit='all' provisioning/playbook.yml (You'll see these command when ansible.verbose is activated)

gildegoma commented 9 years ago

@x-yuri I'm closing this until we receive more information from your part.

x-yuri commented 9 years ago

Sorry for the delay. So, with this setup it still fails occasionally:

$ cat ./Vagrantfile
# -*- mode: ruby -*-
# vi: set ft=ruby :

VAGRANTFILE_API_VERSION = "2"
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "ubuntu/trusty64"
  config.vm.network "private_network", ip: "192.168.88.11"
  config.vm.provision :ansible do |ansible|
    ansible.playbook = "playbook.yml"
    ansible.inventory_path = "hosts"
    ansible.verbose = 'vvvv'
  end
end
$ cat ./hosts
[default]
h1 ansible_ssh_host=192.168.88.11 ansible_ssh_user=vagrant
$ cat ./playbook.yml
---
- hosts: default
  tasks:
    - name: test task
      shell: echo test

Hint: for faster debugging, you also can execute the ansible-playbook generated by Vagrant directly from the shell:

ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --vault-password-file=provisioning/.vault_pass_dev --inventory-file=provisioning/hosts.dev -vvvv --limit='all' provisioning/playbook.yml (You'll see these command when ansible.verbose is activated)

I thought the problem had to do with ansible running too early. Was I wrong? Anyway, I tested it this way: vagrant destroy -f && vagrant up.

You're using a custom inventory file (provisioning/hosts.dev), but Vagrant dynamically allocates the forwarded port for SSH connections.

Still, what exactly do you mean by dynamically allocating ports?

The issues at those links has nothing to do with mine. The first has ssh: connect to host 127.0.0.1 port 22: Connection refused error, and the second fatal: [127.0.0.1] => to use the 'ssh' connection type with passwords, you must install the sshpass program.

UPD I just confirmed that it must have to do with ansible running too early. I ran vagrant destroy -f && vagrant up until I get the error. Then I ran ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --limit='default' --inventory-file=hosts -vvvv playbook.yml and it succeeded:

$ vagrant destroy -f && vagrant up
==> default: Forcing shutdown of VM...
==> default: Destroying VM and associated drives...
==> default: Running cleanup tasks for 'ansible' provisioner...
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'ubuntu/trusty64'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'ubuntu/trusty64' is up to date...
==> default: Setting the name of the VM: 5_default_1417385755633_17573
==> default: Clearing any previously set forwarded ports...
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
    default: Adapter 2: hostonly
==> default: Forwarding ports...
    default: 22 => 2222 (adapter 1)
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Warning: Connection timeout. Retrying...
    default: Warning: Remote connection disconnect. Retrying...
    default: Warning: Remote connection disconnect. Retrying...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
==> default: Configuring and enabling network interfaces...
==> default: Mounting shared folders...
    default: /vagrant => /home/yuri/_/5
==> default: Running provisioner: ansible...
ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --limit='default' --inventory-file=hosts -vvvv playbook.yml

PLAY [default] **************************************************************** 

GATHERING FACTS *************************************************************** 
<192.168.88.11> ESTABLISH CONNECTION FOR USER: vagrant
<192.168.88.11> REMOTE_MODULE setup
<192.168.88.11> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.11', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1417385811.38-111995198978464 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1417385811.38-111995198978464 && echo $HOME/.ansible/tmp/ansible-tmp-1417385811.38-111995198978464'"]
fatal: [h1] => SSH encountered an unknown error. The output was:
OpenSSH_6.7p1, OpenSSL 1.0.1j 15 Oct 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: auto-mux: Trying existing master
debug1: Control socket "/home/yuri/.ansible/cp/ansible-ssh-192.168.88.11-22-vagrant" does not exist
debug2: ssh_connect: needpriv 0
debug1: Connecting to 192.168.88.11 [192.168.88.11] port 22.
debug2: fd 3 setting O_NONBLOCK
debug1: connect to address 192.168.88.11 port 22: Connection timed out
ssh: connect to host 192.168.88.11 port 22: Connection timed out

TASK: [test task] ************************************************************* 
FATAL: no hosts matched or all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/home/yuri/playbook.retry

h1                         : ok=0    changed=0    unreachable=1    failed=0   

Ansible failed to complete successfully. Any error output should be
visible above. Please fix these errors and try again.
$ ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --limit='default' --inventory-file=hosts -vvvv playbook.yml

PLAY [default] **************************************************************** 

GATHERING FACTS *************************************************************** 
<192.168.88.11> ESTABLISH CONNECTION FOR USER: vagrant
<192.168.88.11> REMOTE_MODULE setup
<192.168.88.11> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.11', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030 && echo $HOME/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030'"]
<192.168.88.11> PUT /tmp/tmpiwjrvt TO /home/vagrant/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030/setup
<192.168.88.11> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.11', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /home/vagrant/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030/setup; rm -rf /home/vagrant/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030/ >/dev/null 2>&1'"]
ok: [h1]

TASK: [test task] ************************************************************* 
<192.168.88.11> ESTABLISH CONNECTION FOR USER: vagrant
<192.168.88.11> REMOTE_MODULE command echo test #USE_SHELL
<192.168.88.11> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.11', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877 && echo $HOME/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877'"]
<192.168.88.11> PUT /tmp/tmpr5GMav TO /home/vagrant/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877/command
<192.168.88.11> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.11', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /home/vagrant/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877/command; rm -rf /home/vagrant/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877/ >/dev/null 2>&1'"]
changed: [h1] => {"changed": true, "cmd": "echo test", "delta": "0:00:00.023101", "end": "2014-11-30 22:22:28.782800", "rc": 0, "start": "2014-11-30 22:22:28.759699", "stderr": "", "stdout": "test"}

PLAY RECAP ******************************************************************** 
h1                         : ok=2    changed=1    unreachable=0    failed=0

UPD I'm running arch linux, vagrant-1.6.5 and ansible-1.7.2.

gildegoma commented 9 years ago

@x-yuri thanks for all the info. This is odd... What I see here, is that with your Arch Linux you're using one of the most recent versions of OpenSSH (version 6.7 was released on October 6, 2014). Ubuntu 14.04 is using 6.6p1, Homebrew for Mac provides 6.6.p1_1.

What VM provider are you using actually? KVM? Virtualbox?

I reopen this issue and flag it to hosts/arch for now. I would recommend to search and ask with Arch community as well (https://wiki.archlinux.org/index.php/Vagrant, https://bbs.archlinux.org/search.php)

gildegoma commented 9 years ago

@x-yuri I renamed the issue. Does vagrant ssh command work ?

x-yuri commented 9 years ago

I'm using virtualbox, and vagrant ssh works. It looks like ansible starts too early, when ssh connection cannot be established.

gildegoma commented 9 years ago

Please, can you execute VAGRANT_LOG=debug vagrant provision and gist the full log output

gildegoma commented 9 years ago

and maybe also for vagrant ssh (for comparison)

x-yuri commented 9 years ago

I hope there's no sensitive information there.

gildegoma commented 9 years ago

@x-yuri the last example that you gisted does not fully correspond to the original problem report:

it now uses the autogenarate ansible inventory (--inventory-file=/home/yuri/_/5/.vagrant/provisioners/ansible/inventory)
and it actually works fine

That said, I agree with your analysis that at least the "private network (192.168.88.11)" connectivity is not ready when the ansible provisioner is started.

Next, could you please:

keep your Vagrantfile configured without ansible.inventory_path option (autogenerated inventory), and try: vagrant destroy -f && VAGRANT_LOG=debug vagrant up. I expect that everything will run fine, as the basic localhost network is also used by other vagrant internals to ssh into the VM during the bootstrap.
Re-configured back to ansible.inventory_path = "hosts" and again execute vagrant destroy -f && VAGRANT_LOG=debug vagrant up to try to understand why or how the private network is slow to be enabled.

Note: I updated the issue title again (for helping internet searches).

gildegoma commented 9 years ago

@x-yuri what is your exact versions for:

Virtualbox
Arch Linux distro
Linux kernel

x-yuri commented 9 years ago

Well, if you expected me to gist vagrant provision while it fails, than first, I believe I couldn't possibly do that. Because vagrant provision is run after vagrant up and I didn't see it failing. However I didn't tested it thoroughly. Additionally, if the reason is ansible starting too early, we should inspect vagrant up, not vagrant provision.

it now uses the autogenarate ansible inventory (--inventory-file=/home/yuri/_/5/.vagrant/provisioners/ansible/inventory)

Sorry, I was experimenting with commenting corresponding line out and forgot to uncomment it back.

Re-configured back to ansible.inventory_path = "hosts" and again execute vagrant destroy -f && VAGRANT_LOG=debug vagrant up to try to understand why or how the private network is slow to be enabled.

To make things clear (I'm not sure you noticed the point), it's not just about reexecuting vagrant. I've got to run it until it fails. I'm going to try this command this time: for i in {1..10}; do vagrant destroy -f && VAGRANT_LOG=debug vagrant up || break; done.

As for the versions, virtualbox is of version 4.3.18. Then, arch linux has no such thing as version, the closest thing I can probably provide is the last time I ran pacman -Syu, it was on 2014-11-17 01:29. And the kernel is of version 3.17.3.

Also, I wanted to make it clear, that I didn't do any tweaks to the kernel, and I did minimal adjustments to the configuration of the software I'm using, just the things I needed. Also, I hope it's not only about arch linux :) But I've probably got that to prove yet.

gildegoma commented 9 years ago

@x-yuri sorry, if was unclear, but no worries we'll keep digging into it bravely :hospital:. Also thank you for the software versions clarification, you're running only on bleeding edges :rocket:, and that could be a source of "yet unknown" integration troubles, that's all.

if the reason is ansible starting too early, we should inspect vagrant up, not vagrant provision

Correct, that was my mistake, I should have asked you to run VAGRANT_LOG=debug vagrant up instead, but I mixed myself with copy-paste and so on. Sorry for that :crying_cat_face:. Note that vagrant up automatically kicks the provisioner(s) on the VM initial bootsrap, hence the value to use VAGRANT_LOG=debug vagrant up (with a new VM to be created)

So, my questions are:

with hosts: does vagrant destroy -f; vagrant up systematically fails (ansible timeout) with your current host setup?
with autogenerated inventory: does vagrant destroy -f; vagrant up succeed to run the ansible provisioner without timeout error?

tkellen commented 9 years ago

Just chiming in that I am experiencing this as well. I also have a custom inventory file which contains the following: vagrant ansible_ssh_host=192.168.33.40. I'm doing this so I can set variables specific to development in host_vars/vagrant.

The problem is intermittent and running vagrant provision manually works fine after the VM finishes booting.

x-yuri commented 9 years ago

@tkellen Which OS is it?

@gildegoma This is output for failed and successful runs. It didn't occur to me that less will save the output with escape sequences, so I had to remove them with this script afterwards. As for making it fail with auto-generated inventory, I couldn't reproduce it yet. Probably, I will leave my computer trying to make it happen for a night. I hope I didn't miss any of your requests.

tkellen commented 9 years ago

Multiple hosts. I've seen it on OSX and Ubtunu.

gildegoma commented 9 years ago

@tkellen Thanks for your feedbacks, which obviously enlarge the scope of the problem (I'll update again the issue title and labels).

@x-yuri Awesome work to fetch the logs, thanks a lot! I've started to check and compare vagrant-fail and vagrant-success outputs. In both runs, the setup of 192.168.88.x private network succeed and looks the same (VirtualBox commands and Ubuntu network reconfiguration), but maybe there is a random latency until /sbin/ifup eth1 is really effective... and I don't think that at the moment Vagrant checks/wait until additional network are operational, before going further. This could explain that sometimes the ansible provisioner is fired to early, but I need to further into this (later, not today)...

But for the next investigation steps I propose you two kinds of possible "workarounds":

[ ] Increase SSH connection timeout (e.g. ansible.raw_arguments = ['--timeout=300'] via Vagrantfile or via a local ansible.cfg file if you prefer)

I also noticed that when ansible-playbook fails, we see this message (also reported in https://github.com/ansible/ansible/issues/7759):

debug1: auto-mux: Trying existing master
debug1: Control socket "/home/yuri/.ansible/cp/ansible-ssh-192.168.88.11-22-vagrant" does not exist

Usually, I don't think this should be a problem (SSH simply recreates the control socket file), but maybe it introduces some additional delay that possibly lead to the observed timeout error. So I think you also could try to:

[ ] Disable ControlPersist with ansible.raw_ssh_args = ['ControlPersist=no']

It would be of great help If you could try these two custom configurations :sos:

x-yuri commented 9 years ago

@tkellen To make it clear, was the error exactly "Connection timed out"?

I've seen it on OSX and Ubtunu.

My coworkers had similar complaints, and that's exactly the OSes they use. It's just hard to make them confirm it.

As for ansible.raw_arguments = ['--timeout=300'], it succeeded 352 times. UPD Something might be off here. I need to recheck this.

The way it looks I'll soon be an expert at testing vagrant :)

Just to have it stored somewhere, this command has proved to be easier to use: for (( i = 1; 1; i++ )); do vagrant destroy -f && VAGRANT_LOG=debug vagrant up || break; echo $i > /dev/tty; done &> log

x-yuri commented 9 years ago

The thing is this issue is hard to reproduce. I don't know how many times I should let it recreate VM to be sure that the issue is gone. Supposedly computer must be under heavy load for it to manifest itself. However by now I haven't yet manage to come up with a reliable way to make it happen, let alone fast one.

And I'm not so sure now, that the issue can't happen with auto-generated inventory.

So, do you have any suggestion for how to force it happen? One of the options would be to do it from inside the VM. Also, could you point me out where in the code vagrant waits for network to be operational?

gildegoma commented 9 years ago

The thing is this issue is hard to reproduce.

@x-yuri You did more than pretty well, and I thank you again so much for your efforts. I don't think we need any more tests on your side :heart: (and sorry I asked too much things, it is a bit hard to be complete in asynchronous communication).

And I'm not so sure now, that the issue can't happen with auto-generated inventory.

auto-generated inventory is 100% safe because it uses the "official" ssh_info returned by the VM provider. The same SSH connectivity is used by all vagrant core components (e.g. to configure your private network) during the VM bootstrap phase.

when using an extra (private or public) network address in a custom Ansible inventory, it is possible to encounter some random timeout errors (subject of this problem report). To avoid these errors, it is recommended to increase the SSH connection timeout of ansible-playbook command with ansible.raw_arguments = ['--timeout=30'] (the default value is 10 seconds and this should be empirically adjusted to your environment performances).

could you point me out where in the code vagrant waits for network to be operational

Here are some parts that may interest you:

I close this issue with following further possible enhancement ideas to be evaluated in the future:

Add a new ansible.timeout option to the Ansible provisioner, with a higher default value (e.g. 30 seconds)
Extend the provider / communicator sanity checks, to optionally check the availability of additional networks

@mitchellh what do you think? Are you open to accept pull requests implementing the above ideas?

hashicorp / vagrant

On `vagrant up` with an additional private network, the Ansible provisioner randomly fails on an SSH timeout #4860