Reconnect on socket.error or EOF

kbirkeland commented 7 years ago

A long running connection to a remote device with exec-timeout configured will cause the socket to go EOF. If a subsequent write to that socket is attempted, paramiko returns socket.error. This PR will reopen a closed socket and re-send the command which was attempted.

coveralls commented 7 years ago

Coverage decreased (-0.1%) to 67.324% when pulling 736b28cbaf6fdb958fcbd03f98100b99d36c7541 on kbirkeland:reconnect into 094a60dd9b5f77a2c677fe1c86b61900ba786fca on napalm-automation:develop.

kbirkeland commented 7 years ago

I considered having a max retry counter, but for something like salt I do want it to try as long as the proxy minion is up. A back off timer could be used, but I don't think attempting to open a connection every time commands are attempted is wasteful enough to warrant one.

Any socket.error will raise a NetMikoTimeoutException in ConnectHandler() (which isn't very descriptive). That exception should be caught by call() in the napalm-salt proxy code and returned with the exception in the output. So, as long as open() doesn't eat the exceptions, I don't think a bug is introduced.

dbarrosop commented 7 years ago

Isn't this something that netmiko's keepalive should solve?

ktbyers commented 7 years ago

@dbarrosop Even with keepalive, you will get some disconnects from time-to-time.

I don't really think this fix should be in NAPALM. It is really Salt specific--shouldn't this be in Salt?

To me the logical behavior is to raise an exception if send_command fails. If an application wants to catch the exception and re-connect, they can.

What have we done in other drivers?

@mirceaulinic

Other side-effects of this that I don't like. You will get less intuitive errors (if it fails on the open() ) operation. You will potentially eat-up available SSH-VTY ports since there is no graceful disconnect and you are reopening a new SSH session.

mirceaulinic commented 7 years ago

Just a clarification:

It is really Salt specific

No, it's NAPALM specific (or, to be more precise, very specific to drivers relying on screen scraping). Salt doesn't know one damn thing about the connection with the network device. Salt works with data and processes, not SSH connections etc. That's why NAPALM exists in Salt; otherwise, it wouldn't be used. As an aside, if you open a Python console, open a connection with an IOS device using napalm-ios and leave it like that for a bit, it will die shortly. Would you consider this Python console specific, in that case?

What have we done in other drivers?

eos & nxos are HTTP based
xr: something similar - https://github.com/fooelisa/pyiosxr/blob/master/pyIOSXR/iosxr.py#L147 - didn't see any downside, but only an improvement in stability and usability. However, that case is more complicated, as the connection is considered unusable also when the XML agent is not responding, although the SSH is still alive. In that case, we need to force a disconnect before reconnecting
junos: the NETCONF connection doesn't die that easily

From what I understand, an IOS will kill the SSH session if there isn't any activity on the channel during the timeout value, despite the fact that the SSH connection itself is alive and usable.

While a solution like @kbirkeland's is not optimal, I don't think it would hurt either. I think it's acceptable, and if there are any complaints or bug reports in the future, we can easily remove it.

You will potentially eat-up available SSH-VTY ports since there is no graceful disconnect and you are reopening a new SSH session.

The connection is already closed at the moment (that's why we try to re-open it).

ktbyers commented 7 years ago

@mirceaulinic Let's see if we can work out a solution that meets what Salt requires and is still logical from a NAPALM perspective (and minimizes bad side effects).

The two potential bad side effects I see are:

Masking failures (which will make general napalm-ios troubleshooting harder).
Using up all of the router VTY ports which could lock people out of their router for 10+ minutes.

Outside of the Salt context...I would consider the logical NAPALM behavior to throw an exception. I don't think the logical behavior is to behind the scenes re-open another SSH connection.

From what I understand, an IOS will kill the SSH session if there isn't any activity
on the channel during the timeout value, despite the fact that the SSH connection
itself is alive and usable.

The keepalive we are setting separately should prevent IOS from purging the SSH session (as there will be activity on the channel). It should also reduce/minimize intermediate firewalls from purging the SSH session.

Once the keepalive is incorporated do we even have enough of an issue that we need this (i.e. is this such a rare event that this is unnecessary)? Can we just start with activating the keepalive and see if we need the change in this PR?

Either that or we can set some flag in optional_args to indicate that someone wants this behavior (i.e. something like auto_reconnect=True). By default napalm-ios would not do this, but you could overwrite the behavior.

mirceaulinic commented 7 years ago

Yes, I agree with your thoughts @ktbyers. Firstly let's see how it works with the keepalive we are introducing now. If this still doesn't help, we can continue investigating. Having an optional argument for activating the session re-establishment sounds good to me.

ktbyers commented 7 years ago

On hold waiting to see if needed (pending addition of keepalive).

kbirkeland commented 7 years ago

Paramiko's transport keepalive is implemented similar to openssh's ServerAliveInterval. In my testing, this doesn't stop exec-timeout from killing a session.

mirceaulinic commented 7 years ago

Alright, so keepalive doesn't help much in that case. @kbirkeland could you please explain your testing scenario? I would like to reproduce and try to understand what happens.

kbirkeland commented 7 years ago

Set exec-timeout to x seconds and set ServerAliveInterval to some value y <= x. Login and wait at least x seconds without user input. The connection will be closed by the remote host.

ktbyers commented 7 years ago

@kbirkeland Good point.

I verified this and it was as @kbirkeland described (i.e. the SSH connection would be closed by router when the keepalive was set at the Paramiko level).

Looking online, it looks like you could send an ASCII null character (0x00) down the channel periodically and this would keep the session alive.

https://www.ccierackrentals.com/ccie-lab-faq/prevent-session-timeout

I verified this worked:

>>> NULL = chr(0)
>>> for i in range(10):
...   print i
...   net_connect.write_channel(NULL)
...   time.sleep(10)
... 
0
1
2
3
4
5
6
7
8
9
>>> 
>>> 
>>> 
>>> net_connect.find_prompt()
u'pynet-rtr1#'

This is with a one-minute SSH disconnect time set on the router (i.e. vty configured for exec-timeout 1 0)

kbirkeland commented 7 years ago

Interesting that NUL works. I don't think thats necessarily a good choice for within napalm-ios or netmiko, because it would need some sort of polling to be checked at regular intervals. Also, considering how many different IOS versions there are, I could see sending NULs triggering some weird bug.

Looking back to napalm-automation/napalm-base#247 it looks like this shouldn't reconnect, but raise a new NAPALM exception instead. I can get to this after the weekend.

ktbyers commented 7 years ago

Yes, agreed, I don't think the NUL should be in Netmiko/NAPALM. This is more of a 'connection manager' function which I think should be outside these libraries.

mirceaulinic commented 7 years ago

Very interesting finding. I would have imagined that paramiko does that already. I would be tempted to ask the paramiko maintainers, I didn't see anything related to the past issues or pull requests.

Meanwhile, would you agree to have a helper method that does only the following:

def anti_idle(self):
    self.write_channel(chr(0))

I closed https://github.com/napalm-automation/napalm-base/pull/247 as we have agreed on the new exception.

ktbyers commented 7 years ago

@mirceaulinic Yes, I don't have an objection at all to having that in either Netmiko and/or NAPALM.

mirceaulinic commented 7 years ago

@ktbyers submitted https://github.com/napalm-automation/napalm-ios/pull/155 for the null byte task. @kbirkeland napalm-base 0.24.0 is relased - you can import now the new exception and raise it when appropriate.

ktbyers commented 7 years ago

@kbirkeland @mirceaulinic I assume this PR can be closed and will be replaced by the is_alive PR?

mirceaulinic commented 7 years ago

No, as discussed and agreed under https://github.com/napalm-automation/napalm-base/pull/247, then introduced in https://github.com/napalm-automation/napalm-base/pull/248, we will be raising ConnectionClosedException when the connection is closed. In other words, we are waiting for @kbirkeland to apply the following changes to this PR:

replace:

except (socket.error, EOFError):
    self.open()
    return self._send_command(command)

with:

except (socket.error, EOFError) as err:
    raise ConnectionClosedException(err)

coveralls commented 7 years ago

Coverage decreased (-0.07%) to 70.1% when pulling 83dffa439d71458bea06fbf9e87d3c826380f17c on kbirkeland:reconnect into 6c8504602de0c82bd0d496cd65034b15d272f322 on napalm-automation:develop.

napalm-automation / napalm-ios

Reconnect on socket.error or EOF #152