matschaffer / knife-solo

DEPRECATED: Please consider using https://knife-zero.github.io/, ansible, or visit https://www.chef.io/ for other ideas
MIT License
786 stars 213 forks source link

fix: try to rescue Net::SSH::Disconnect #536

Open azrle opened 5 years ago

azrle commented 5 years ago

Hi @matschaffer,

I sometimes encounter Net::SSH::Disconnect when trying to run chef-solo on slow machines.

/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/packet_stream.rb:89:in `next_packet': connection closed by remote host (Net::SSH::Disconnect)
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/session.rb:193:in `block in poll_message'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/session.rb:188:in `loop'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/session.rb:188:in `poll_message'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:474:in `dispatch_incoming_packets'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:225:in `preprocess'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:206:in `process'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `block in loop'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `loop'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `loop'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/channel.rb:269:in `wait'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/knife-solo/ssh_connection.rb:81:in `run_command'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/knife-solo/ssh_command.rb:300:in `run_command'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/knife-solo/ssh_command.rb:277:in `stream_command'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/chef/knife/solo_cook.rb:338:in `cook'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/chef/knife/solo_cook.rb:108:in `block in run'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/chef/knife/solo_cook.rb:233:in `time'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/chef/knife/solo_cook.rb:88:in `run'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/lib/chef/knife.rb:443:in `block in run_with_pretty_exceptions'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/lib/chef/local_mode.rb:44:in `with_server_connectivity'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/lib/chef/knife.rb:442:in `run_with_pretty_exceptions'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/lib/chef/knife.rb:219:in `run'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/lib/chef/application/knife.rb:156:in `run'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/bin/knife:25:in `<top (required)>'
        from /opt/chef/embedded/bin/knife:22:in `load'
        from /opt/chef/embedded/bin/knife:22:in `<main>'

After some investigation, I found that the server indeed closed the connection since it has been idle for a while. The reason why the connection becomes idle is simply that the machine (chef-solo runs on) is slow and sometimes it takes time to do works. Hence, I try to set the ssh_keepalive and ssh_keepalive_interval to a reasonable value in order to send keepalive message when the connection is idle. However it does not work as expected.

As the investigation goes, I found that keepalive and keepalive_interval only work for the case that you are running a slow command on the remote or you do Net::SSH::Session::loop because net-ssh is based on event and so is the keepalive of net-ssh. I am not sure if there is any way to keep the connection alive and reuse the connection efficiently.

However, it may be good to rescue from Net::SSH::Disconnect and retry the command. The caveat is that a command might be executed on the remote more than once.

Not sure are these issues are related to this one. https://github.com/matschaffer/knife-solo/issues/272 https://github.com/matschaffer/knife-solo/issues/332

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-0.2%) to 89.995% when pulling 9f1af2ae927da70cffb67875574cf4d11ecfb301 on azrle:fix/net_ssh_disconnect into ce8c2237af961394059351260471f6ef417967ba on matschaffer:master.

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-0.2%) to 89.995% when pulling 9f1af2ae927da70cffb67875574cf4d11ecfb301 on azrle:fix/net_ssh_disconnect into ce8c2237af961394059351260471f6ef417967ba on matschaffer:master.

matschaffer commented 5 years ago

Have you checked if chef exits on disconnect? If not, seems like retry could be problematic

azrle commented 5 years ago

In my case, it does not exist. The run_command tried to run chef on a closed SSH session and generated Net::SSH::Disconnect. Hence, the chef did not even start.

Checking Chef version...
Resolving cookbook dependencies...
Uploading the kitchen...
Generating solo config...
ERROR: Net::SSH::Disconnect: connection closed by remote hostRunning Chef: sudo chef-solo -c /var/tmp/chef-solo_xxx/solo.rb -j /var/tmp/chef-solo_xxx/dna.jso
n -W

However, risk exists as Net::SSH::Disconnect may happen under other conditions.

matschaffer commented 5 years ago

That's surprising to see that part of the run take so long.

Anyway if it helps your situation I'd be fine with a configurable retry. But seems risky as a default.

azrle commented 5 years ago

It also surprises me too. I am going to conduct another investigation why it takes some much time. Maybe resolving dependencies part and rsync part.

Considering this error should not be a common case, and this fix is not a proper fix, I will try to add an option to support this workaround. Does it sound good to you?

matschaffer commented 5 years ago

Yeah I don't see an issue if it's configurable and off by default.

On Sun, Dec 23, 2018 at 23:40 Xuanzhong Wei notifications@github.com wrote:

It also surprises me too. I am going to conduct another investigation why it takes some much time. Maybe resolving dependencies part and rsync part.

Considering this error should not be a common case, and this fix is not a proper fix, I will try to add an option to support this workaround. Does it sound good to you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/matschaffer/knife-solo/pull/536#issuecomment-449640634, or mute the thread https://github.com/notifications/unsubscribe-auth/AAACsjMd0VGgrzOs_9yVY0juTECYF07zks5u75XOgaJpZM4ZftGB .

--

-Mat

matschaffer.com