test-kitchen / kitchen-ec2

A Test Kitchen Driver for Amazon EC2
Other
222 stars 203 forks source link

SCP intermittently flaky with Amazon Linux #288

Closed bsamek closed 6 years ago

bsamek commented 7 years ago

We're running kitchen-ec2 as part of our CI workflow. About 3% of the time on Amazon Linux there is an error transferring files after installing Chef. We run on a variety of ec2 variants, including several versions of Debian, RHEL, SUSE, and Ubuntu, and I've only seen this on Amazon.

One thing that stands out to me is how much more quickly Amazon Linux instances boot than other instance types. Is it possible there is something that isn't yet ready about the instance even though ec2-kitchen can SSH into it? Does installing Chef use a different code path (i.e., not Net::SCP) than the transferring files step? If so, is it possible to insert a wait somewhere to give the instance some more time to prep?

[2016/12/14 12:05:38.669]        Waiting for SSH service on ec2-54-198-26-226.compute-1.amazonaws.com:22, retrying in 3 seconds
 [2016/12/14 12:05:42.453]        [SSH] Established
 [2016/12/14 12:05:42.721]        Finished creating <service-amazon> (1m11.67s).
 [2016/12/14 12:05:42.721] -----> Converging <service-amazon>...
 [2016/12/14 12:05:42.722]        Preparing files for transfer
 [2016/12/14 12:05:42.722]        Preparing dna.json
 [2016/12/14 12:05:42.723]        Preparing current project directory as a cookbook
 [2016/12/14 12:05:42.725]        Removing non-cookbook files before transfer
 [2016/12/14 12:05:42.727]        Preparing solo.rb
 [2016/12/14 12:05:42.781] -----> Installing Chef Omnibus (12.6.0)
 [2016/12/14 12:05:42.792]        Downloading https://omnitruck.chef.io/install.sh to file /tmp/install.sh
 [2016/12/14 12:05:42.792]        Trying wget...
 [2016/12/14 12:05:42.936]        Download complete.
 [2016/12/14 12:05:42.989]        el 6 x86_64
 [2016/12/14 12:05:42.989]        Getting information for chef stable 12.6.0 for el...
 [2016/12/14 12:05:42.990]        downloading https://omnitruck.chef.io/stable/chef/metadata?v=12.6.0&p=el&pv=6&m=x86_64
 [2016/12/14 12:05:42.990]          to file /tmp/install.sh.2671/metadata.txt
 [2016/12/14 12:05:42.990]        trying wget...
 [2016/12/14 12:05:43.036]        sha1  c05e78b8c3dc193ffae9d1084367c5f4a2118204
 [2016/12/14 12:05:43.037]        sha256    152e9aeb3c654db19b2e72cc290ec6d92bf6b40e74807c3eb1a0c828f20c99b9
 [2016/12/14 12:05:43.037]        url   https://packages.chef.io/files/stable/chef/12.6.0/el/6/chef-12.6.0-1.el6.x86_64.rpm
 [2016/12/14 12:05:43.037]        version   12.6.0
 [2016/12/14 12:05:43.040]        downloaded metadata file looks valid...
 [2016/12/14 12:05:43.054]        downloading https://packages.chef.io/files/stable/chef/12.6.0/el/6/chef-12.6.0-1.el6.x86_64.rpm
 [2016/12/14 12:05:43.054]          to file /tmp/install.sh.2671/chef-12.6.0-1.el6.x86_64.rpm
 [2016/12/14 12:05:43.055]        trying wget...
 [2016/12/14 12:05:43.804]        Comparing checksum with sha256sum...
 [2016/12/14 12:05:43.982]        Installing chef 12.6.0
 [2016/12/14 12:05:43.982]        installing with rpm...
 [2016/12/14 12:05:44.009]        warning: /tmp/install.sh.2671/chef-12.6.0-1.el6.x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID 83ef826a: NOKEY
 [2016/12/14 12:05:44.134]        Preparing...   
 [2016/12/14 12:05:44.149]        Updating / installing...
 [2016/12/14 12:05:45.535]           1:chef-12.6.0-1.el6             
 [2016/12/14 12:05:47.465]        Thank you for installing Chef!
 [2016/12/14 12:05:47.652]        Transferring files to <service-amazon>
 [2016/12/14 12:05:48.763] /var/lib/gems/2.3.0/gems/net-scp-1.2.1/lib/net/scp.rb:365:in `block (3 levels) in start_command': SCP did not finish successfully ():  (Net::SCP::Error)
 [2016/12/14 12:05:48.763]  from /var/lib/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/channel.rb:607:in `do_close'
 [2016/12/14 12:05:48.763]  from /var/lib/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:498:in `block in force_channel_cleanup_on_close'
 [2016/12/14 12:05:48.763]  from /var/lib/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:493:in `each'
 [2016/12/14 12:05:48.763]  from /var/lib/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:493:in `force_channel_cleanup_on_close'
 [2016/12/14 12:05:48.763]  from /var/lib/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:214:in `rescue in process'
 [2016/12/14 12:05:48.763]  from /var/lib/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:206:in `process'
 [2016/12/14 12:05:48.763]  from /var/lib/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `block in loop'
 [2016/12/14 12:05:48.763]  from /var/lib/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `loop'
 [2016/12/14 12:05:48.764]  from /var/lib/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `loop'
 [2016/12/14 12:05:48.764]  from /var/lib/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:119:in `close'
 [2016/12/14 12:05:48.764]  from /var/lib/gems/2.3.0/gems/test-kitchen-1.14.2/lib/kitchen/transport/ssh.rb:115:in `close'
 [2016/12/14 12:05:48.764]  from /var/lib/gems/2.3.0/gems/test-kitchen-1.14.2/lib/kitchen/transport/ssh.rb:98:in `cleanup!'
 [2016/12/14 12:05:48.764]  from /var/lib/gems/2.3.0/gems/test-kitchen-1.14.2/lib/kitchen/instance.rb:285:in `cleanup!'
 [2016/12/14 12:05:48.764]  from /var/lib/gems/2.3.0/gems/test-kitchen-1.14.2/lib/kitchen/command.rb:205:in `run_action_in_thread'
 [2016/12/14 12:05:48.764]  from /var/lib/gems/2.3.0/gems/test-kitchen-1.14.2/lib/kitchen/command.rb:169:in `block (2 levels) in run_action'
ianmiell commented 7 years ago

I am seeing this also in an automated build.

adecarolis commented 6 years ago

I have the same issue, the majority of the times. @bsamek were you able to find a solution? Thanks

bsamek commented 6 years ago

@adecarolis - No, I just run it in a loop.

ricoli commented 6 years ago

I simply switched to using https://github.com/unibet/kitchen-transport-rsync as this issue was very annoying.

fletchowns commented 6 years ago

https://github.com/coderanger/kitchen-sync is a convenient workaround for Windows since rsync is kinda annoying to install

cheeseplus commented 6 years ago

We pulled in a fix to instance store backed instances which are most often Amazon linux which should resolve this partially. Going to close this one out for now but happy to re-open a fresh issue with current versions if we need to hunt this one down more.

bsamek commented 6 years ago

@cheeseplus What commit fixes this issue?

cheeseplus commented 6 years ago

There were improvements to instance store backed instances here https://github.com/test-kitchen/kitchen-ec2/commit/fb12ad38b2ba07d6b382de33cf4d91f3f9f0b94d but also a lots of other movement in other deps in the ecosystem.

If this is still reproducible with current versions I'm happy to keep this open but we would needed updated logs, versions, and ideally a repro case

bsamek commented 6 years ago

@cheeseplus I'm cool with closing this for now. Just wanted to see the code. Thanks again!

ricoli commented 6 years ago

I'm afraid I still experience this issue when using kitchen-ec2 v2.2.1 and net-scp 1.2.1 :( Anyone else?

bsamek commented 6 years ago

We have also observed similar behavior outside of test-kitchen. Occasionally our CI system can start a host in EC2 and run an SSH command. Then, some seconds later, when it attempts to SCP a small file onto the host, the SCP command returns host unreachable. I think the most probable explanation is that hosts aren't truly up yet, that there is some moment in the init process when the host is accessible over SSH, but then SSH restarts, for example.