mitchellh / vagrant-aws

Use Vagrant to manage your EC2 and VPC instances.
MIT License
2.61k stars 574 forks source link

rsync: Failed to connect to new control master: Broken pipe #495

Closed kenorb closed 7 years ago

kenorb commented 7 years ago

Problem

The rsync fails when running: vagrant up --no-provision --provider=aws with the following errors:

Output:

00:00:01.107 + vagrant up --no-provision --provider=aws
00:00:03.273 Bringing machine 'MT-aws-Test-VM' up with 'aws' provider...
00:00:03.382 ==> MT-aws-Test-VM: Launching an instance with the following settings...
00:00:03.382 ==> MT-aws-Test-VM:  -- Type: t2.small
00:00:03.383 ==> MT-aws-Test-VM:  -- AMI: ami-a1e5ddb6
00:00:03.383 ==> MT-aws-Test-VM:  -- Region: us-east-1
00:00:03.383 ==> MT-aws-Test-VM:  -- Keypair: Foo
00:00:03.383 ==> MT-aws-Test-VM:  -- Subnet ID: subnet-2997c75e
00:00:03.383 ==> MT-aws-Test-VM:  -- User Data: yes
00:00:03.384 ==> MT-aws-Test-VM:  -- User Data: sed -i'.bak' -e 's/^\(Defaults\s\+requiretty\)/# \1/' /etc/sudoers
00:00:03.384 ==> MT-aws-Test-VM:  -- Block Device Mapping: [{"DeviceName"=>"/dev/sda1", "Ebs.VolumeSize"=>"12"}]
00:00:03.384 ==> MT-aws-Test-VM:  -- Terminate On Shutdown: 1
00:00:03.384 ==> MT-aws-Test-VM:  -- Monitoring: false
00:00:03.384 ==> MT-aws-Test-VM:  -- EBS optimized: false
00:00:03.385 ==> MT-aws-Test-VM:  -- Source Destination check: 
00:00:03.385 ==> MT-aws-Test-VM:  -- Assigning a public IP address in a VPC: false
00:00:03.385 ==> MT-aws-Test-VM:  -- VPC tenancy specification: default
00:00:05.973 ==> MT-aws-Test-VM: Waiting for instance to become "ready"...
00:00:32.953 ==> MT-aws-Test-VM: Waiting for SSH to become available...
00:01:38.812 ==> MT-aws-Test-VM: Machine is booted and ready for use!
00:01:45.617 ==> MT-aws-Test-VM: Rsyncing folder: /var/lib/jenkins/workspace/Test-VM/ => /vagrant
00:02:51.620 There was an error when attempting to rsync a synced folder.
00:02:51.620 Please inspect the error message below for more info.
00:02:51.620 
00:02:51.620 Host path: /var/lib/jenkins/workspace/Test-VM/
00:02:51.620 Guest path: /vagrant
00:02:51.620 Command: rsync --verbose --archive --delete -z --copy-links --no-owner --no-group --rsync-path sudo rsync -e ssh -p 22 -o ControlMaster=auto -o ControlPath=/tmp/ssh.211 -o ControlPersist=10m -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o UserKnownHostsFile=/dev/null -i '/var/lib/jenkins/secrets/Foo.pem' --exclude .vagrant/ /var/lib/jenkins/workspace/Test-VM/ ubuntu@54.91.100.45:/vagrant
00:02:51.620 Error: Warning: Permanently added '54.91.100.45' (ECDSA) to the list of known hosts.
00:02:51.620 Timeout, server 54.91.100.45 not responding.
00:02:51.620 mux_client_request_session: read from master failed: Broken pipe
00:02:51.620 Failed to connect to new control master
00:02:51.620 rsync: connection unexpectedly closed (0 bytes received so far) [sender]
00:02:51.620 rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.1]

What I've tried

Extra info

Added VAGRANT_LOG=debug to see more details and rsync is exactly the same command between working and failing scenario (for two different AMIs). The rsync command is:

00:01:14.963  INFO subprocess: Starting process: ["/usr/bin/rsync", "--verbose", "--archive", "--delete", "-z", "--copy-links", "--no-owner", "--no-group", "--rsync-path", "sudo rsync", "-e", "ssh -p 22 -o ControlMaster=auto -o ControlPath=/tmp/ssh.673 -o ControlPersist=10m -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o UserKnownHostsFile=/dev/null -i '/var/lib/jenkins/secrets/Foo.pem'", "--exclude", ".vagrant/", "/var/lib/jenkins/workspace/Test-VM/", "ubuntu@54.158.16.149:/vagrant"]

Extra comments

The mitchellh/vagrant/issues/6780 is similar, but I don't have 'sudo: sorry, you must have a tty to run sudo' message, so it's related to ControlMaster.

In the attachment please find two Jenkins logs running in exactly the same way (the only difference is different AMI). This issue is repeatable. I've compared these two files (using DiffMerge) and they're almost exactly the same, apart the control master error that it cannot connect to new control master.

I don't think it's a problem with remote or sudoerrs file, as the Vagrant cannot even connect to the host using control master file. Although it can connect before (as per ubuntu@ip-172-30-2-117:~$ prompt being received).

This is quiet weird issue and I don't know how to debug this further.

Check the attached log files: consoleText-11.txt consoleText-12.txt

Related: GH-340, mitchellh/vagrant/issues/6702 (but this happens on Ubuntu Linux machine).

The workaround could be to comment out this line:

control_options = "-o ControlMaster=auto -o ControlPath=#{controlpath} -o ControlPersist=10m "

from helper.rb file. Although in my 1.8.1 I've got 3 lines to comment out. But then it fails with:

00:02:54.134 Command: rsync --verbose --archive --delete -z --copy-links --no-owner --no-group --rsync-path sudo rsync -e ssh -p 22 -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o UserKnownHostsFile=/dev/null -i '/var/lib/jenkins/secrets/Foo.pem' --exclude .vagrant/ /var/lib/jenkins/workspace/Foo/ ubuntu@54.144.49.114:/vagrant
00:02:54.134 Error: Warning: Permanently added '54.144.49.114' (ECDSA) to the list of known hosts.
00:02:54.134 Timeout, server 54.144.49.114 not responding.
00:02:54.134 rsync: connection unexpectedly closed (0 bytes received so far) [sender]
00:02:54.134 rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.1]

auth.log file from remote

Here is auth.log file from that EC2 instance:

Dec  2 17:03:58 ip-172-30-2-45 sshd[1067]: Server listening on 0.0.0.0 port 22.
Dec  2 17:03:58 ip-172-30-2-45 sshd[1067]: Server listening on :: port 22.
Dec  2 17:04:02 ip-172-30-2-45 sshd[1168]: Accepted publickey for ubuntu from 138.201.172.213 port 54072 ssh2: RSA bb:b9:c9:e5:e1:xx:xx:xx:xx:6e:8a:57:78:cd:12:89
Dec  2 17:04:02 ip-172-30-2-45 sshd[1168]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Dec  2 17:04:02 ip-172-30-2-45 systemd-logind[993]: New session 1 of user ubuntu.
Dec  2 17:04:12 ip-172-30-2-45 sudo:   ubuntu : unable to resolve host ip-172-30-2-45
Dec  2 17:04:12 ip-172-30-2-45 sudo:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/bash -l
Dec  2 17:04:12 ip-172-30-2-45 sudo: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Dec  2 17:04:13 ip-172-30-2-45 sudo: pam_unix(sudo:session): session closed for user root
Dec  2 17:04:14 ip-172-30-2-45 sshd[1563]: Accepted publickey for ubuntu from 138.201.172.213 port 54078 ssh2: RSA bb:b9:c9:e5:e1:xx:xx:xx:xx:6e:8a:57:78:cd:12:89
Dec  2 17:04:14 ip-172-30-2-45 sshd[1563]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Dec  2 17:04:14 ip-172-30-2-45 systemd-logind[993]: New session 2 of user ubuntu.
Dec  2 17:05:14 ip-172-30-2-45 sshd[1168]: pam_unix(sshd:session): session closed for user ubuntu
Dec  2 17:05:56 ip-172-30-2-45 sshd[1563]: pam_unix(sshd:session): session closed for user ubuntu
Dec  2 17:05:56 ip-172-30-2-45 systemd-logind[993]: Removed session 1.
Dec  2 17:12:05 ip-172-30-2-45 sshd[1619]: Accepted publickey for ubuntu from 138.201.172.213 port 54082 ssh2: RSA bb:b9:c9:e5:e1:xx:xx:xx:xx:6e:8a:57:78:cd:12:89
Dec  2 17:12:05 ip-172-30-2-45 sshd[1619]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Dec  2 17:12:05 ip-172-30-2-45 systemd-logind[993]: Removed session 2.
Dec  2 17:12:05 ip-172-30-2-45 systemd-logind[993]: New session 3 of user ubuntu.
Dec  2 17:12:06 ip-172-30-2-45 sudo:   ubuntu : unable to resolve host ip-172-30-2-45
Dec  2 17:12:06 ip-172-30-2-45 sudo:   ubuntu : TTY=unknown ; PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/bin/rsync --server -vlDtprze.iLsfx --delete . /vagrant
Dec  2 17:12:06 ip-172-30-2-45 sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
Dec  2 17:12:07 ip-172-30-2-45 sudo: pam_unix(sudo:session): session closed for user root
Dec  2 17:12:07 ip-172-30-2-45 sshd[1673]: Received disconnect from 138.201.172.213: 11: disconnected by user
Dec  2 17:12:07 ip-172-30-2-45 sshd[1619]: pam_unix(sshd:session): session closed for user ubuntu

sudo: ubuntu : TTY=unknown ; PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/bin/rsync --server -vlDtprze.iLsfx --delete . /vagrant

/etc/sudoers file from remote

#
# This file MUST be edited with the 'visudo' command as root.
#
# Please consider adding local content in /etc/sudoers.d/ instead of
# directly modifying this file.
#
# See the man page for details on how to write a sudoers file.
#
Defaults    env_reset
Defaults    mail_badpass
Defaults    secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

# Host alias specification

# User alias specification

# Cmnd alias specification

# User privilege specification
root    ALL=(ALL:ALL) ALL

# Members of the admin group may gain root privileges
%admin ALL=(ALL) ALL

# Allow members of group sudo to execute any command
%sudo   ALL=(ALL:ALL) ALL

# See sudoers(5) for more information on "#include" directives:

#includedir /etc/sudoers.d

But it's the same on both working and non-working instances.

SSH debug (-vvv)

Brief:

00:03:42.545  Command: rsync --verbose --verbose --verbose --archive --delete -z --copy-links --no-owner --no-group --rsync-path sudo rsync -e ssh -vvv -p 22 -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o UserKnownHostsFile=/dev/null -i '/var/lib/jenkins/secrets/Foo.pem' --exclude .vagrant/ /var/lib/jenkins/workspace/Test/ ubuntu@184.72.209.209:/vagrant
00:03:42.545  Error: OpenSSH_7.2p2 Ubuntu-4ubuntu1, OpenSSL 1.0.2g-fips  1 Mar 2016
00:03:42.545  debug1: Reading configuration data /var/lib/jenkins/.ssh/config
...
00:03:42.545  debug3: sign_and_send_pubkey: RSA SHA256:XXX/pek
00:03:42.545  debug3: send packet: type 50
00:03:42.545  debug2: we sent a publickey packet, wait for reply
00:03:42.545  debug3: receive packet: type 52
00:03:42.545  debug1: Authentication succeeded (publickey).
00:03:42.545  Authenticated to 184.72.209.209 ([184.72.209.209]:22).
00:03:42.545  debug2: fd 4 setting O_NONBLOCK
00:03:42.545  debug3: fd 5 is O_NONBLOCK
00:03:42.545  debug2: fd 6 setting O_NONBLOCK
00:03:42.545  debug1: channel 0: new [client-session]
00:03:42.545  debug3: ssh_session2_open: channel_new: 0
00:03:42.545  debug2: channel 0: send open
00:03:42.545  debug3: send packet: type 90
00:03:42.545  debug1: Requesting no-more-sessions@openssh.com
00:03:42.545  debug3: send packet: type 80
00:03:42.545  debug1: Entering interactive session.
00:03:42.545  debug1: pledge: network
00:03:42.545  debug3: send packet: type 80
00:03:42.545  debug3: send packet: type 80
00:03:42.545  debug3: send packet: type 80
00:03:42.545  Timeout, server 184.72.209.209 not responding.
00:03:42.545  rsync: connection unexpectedly closed (0 bytes received so far) [sender]
00:03:42.545  rsync error: error in rsync protocol data stream (code 12) at io.c(226) [sender=3.1.1]
00:03:42.583  Build step 'Execute shell' marked build as failure

Based on above it sounds like SSH entered interactive session and has been successfully authenticated, but it got the timeout, which is weird, as it took only 62 seconds to show the error. Possibly increasing the timeout may help?

kenorb commented 7 years ago

Ok, solved the mystery.

The timeout in SSH config was too short, because it was controlled by ServerAliveInterval/ServerAliveInterval in the ~/.ssh/config file. So I had to increase it.

Why one AMI failed over another, because the failing AMI had already some /vagrant folder, so it took longer for rsync to calculate the differences.

Here is the new settings in ~/.ssh/config file on the client side:

Host *
  ServerAliveInterval 60
  ServerAliveCountMax 3

If the timeout is too short, it usually ends with Broken pipe.