napalm-automation / napalm

Network Automation and Programmability Abstraction Layer with Multivendor support
Apache License 2.0
2.24k stars 552 forks source link

Salt Proxy minion drops the connection with XRv 9000 after 1 minute #579

Closed adampav closed 4 years ago

adampav commented 6 years ago

Hello, i noticed this problem while experimenting on an XRv 9000. After approximately one minute of normal operation the proxy minion is not able to interact with the specific device. I tested the same on an IOSXE device and i didn't face such problems. I attach some DEBUG log lines 2017-12-02 13:28:44,563 [salt.utils.lazy ][DEBUG ][20954] LazyLoaded status.proxy_reconnect 2017-12-02 13:28:44,564 [netmiko ][DEBUG ][20954] Sending the NULL byte 2017-12-02 13:28:44,564 [netmiko ][DEBUG ][20954] write_channel: 2017-12-02 13:28:44,564 [/usr/lib/python2.7/dist-packages/salt/proxy/napalm.pyc ][DEBUG ][20954] Is xrv1 still alive? Yes.

As soon as those lines above appear i am not able to interact with the device.

2017-12-02 13:28:53,163 [salt.minion ][INFO ][20954] Starting a new job with PID 20954 2017-12-02 13:28:53,364 [netmiko ][DEBUG ][20954] read_channel: 2017-12-02 13:28:53,365 [netmiko ][DEBUG ][20954] write_channel: <?xml version="1.0" encoding="UTF-8"?>show ver 2017-12-02 13:28:53,365 [netmiko ][DEBUG ][20954] read_channel: 2017-12-02 13:28:53,566 [netmiko ][DEBUG ][20954] read_channel: 2017-12-02 13:28:53,766 [netmiko ][DEBUG ][20954] read_channel:

workaround as suggested by @mirceaulinic: i have set the always_connected flag to false.

mirceaulinic commented 6 years ago

That's unfortunate. I've noticed the same some time ago, but I forgot to submit the report.

Apparently our method is_alive does more harm than good, as besides checking the state of the SSH connection we also send the NULL byte, which destroys the connection: https://github.com/ktbyers/netmiko/blob/master/netmiko/base_connection.py#L248 Even though this is already tracked under https://github.com/ktbyers/netmiko/issues/568, I believe we should remove this for IOS-XR, as it seems very sensitive to this (other napalm platforms don't seem to be affected, or at least I'm not aware of). CC @ktbyers

@adampav You can have the always_alive: false for the NAPALM Proxy indeed, or proxy_keep_alive: false global option: https://docs.saltstack.com/en/develop/ref/configuration/proxy.html#std:conf_proxy-proxy_keep_alive. The difference is that always_alive: false will instruct the proxy to not attempt keeping the session always alive, while the latter proxy_keep_alive: false will open the connection which will stay alive till the network device will drop the connection.

ktbyers commented 6 years ago

@mirceaulinic Can you expand on this? So the Null byte causes the IOS-XR to get in a messed-up state?

Is it because we are in an XML agent context?

Seems strange...but we can definitely do something different (or give an option to is_alive to actively test the connection versus just querying paramiko).

I thought this was what Secure CRT SSH session keepalive did (send a null byte). I will have to look into that again.

adampav commented 6 years ago

@mirceaulinic The proxy_keep_alive option seems better since it allows for a quick flurry of operations. i can always increase the ssh timeouts on the XRv Many thanks again

mirceaulinic commented 6 years ago

Sorry for late reply @ktbyers:

@mirceaulinic Can you expand on this? So the Null byte causes the IOS-XR to get in a messed-up state?

Yes, but I still don't know why.

Is it because we are in an XML agent context?

This is what I suspect.

I will need to investigate this closer to understand what's actually going on there and why. Thanks!

ktbyers commented 6 years ago

@mirceaulinic No worries...just let me know what you find.

noobcoderT commented 6 years ago

Hello, thanks @mirceaulinic and @ktbyers to provide these nice libs. I am now using salt with napalm to manage a cisco router which running ios, and I use telnet to connect this device. I have got the same problem. If I use always_alive option, and do nothing in 1 minute, I will lose this connection. If I make it false, it becomes very slow to contact with my device. So, what will I do? Could you please give me some advice?

mirceaulinic commented 6 years ago

Hi @ktbyers I had a closer look into this, and the NULL byte definitely breaks the connection in the XML context:

>>> i.open()
>>> w = i.get_arp_table()
>>> i.is_alive()
{u'is_alive': True}
>>> i.is_alive()
{u'is_alive': True}
>>> i.is_alive()
{u'is_alive': True}
>>> i.is_alive()
{u'is_alive': True}
>>> i.is_alive()
{u'is_alive': True}
>>> w = i.get_arp_table()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/state/home/mircea/venvs/cf-napalm/local/lib/python2.7/site-packages/napalm/iosxr/iosxr.py", line 1114, in get_arp_table
    result_tree = ETREE.fromstring(self.device.make_rpc_call(rpc_command))
  File "/state/home/mircea/venvs/cf-napalm/local/lib/python2.7/site-packages/pyIOSXR/iosxr.py", line 151, in make_rpc_call
    result = self._execute_rpc(rpc_command)
  File "/state/home/mircea/venvs/cf-napalm/local/lib/python2.7/site-packages/pyIOSXR/iosxr.py", line 365, in _execute_rpc
    response = self._send_command(xml_rpc_command, delay_factor=delay_factor)
  File "/state/home/mircea/venvs/cf-napalm/local/lib/python2.7/site-packages/pyIOSXR/iosxr.py", line 342, in _send_command
    if not self._timeout_exceeded(start=start):
  File "/state/home/mircea/venvs/cf-napalm/local/lib/python2.7/site-packages/pyIOSXR/iosxr.py", line 190, in _timeout_exceeded
    raise TimeoutError(msg, self)
pyIOSXR.exceptions.TimeoutError: Timeout exceeded!

So the is_alive always returns True (as it is able to send the NULL byte, but the underlying netmiko layer doesn't fail), although the connection is not usable anymore.

Logs:

ss><Status>StatusResolutionRequest</Status><ClientID>0</ClientID><EntryState>0</EntryState><ResolutionRequestCount>1227636</ResolutionRequestCount></Entry></a
DEBUG:netmiko:read_channel:
DEBUG:netmiko:read_channel: rpEntry></ResolutionHistoryDynamic><ResolutionHistoryClient><arpEntry/></ResolutionHistoryClient></Node></NodeTable></ARP></Operational></Get><ResultSummary ErrorCount="0"/></Response>
XML>
DEBUG:netmiko:Sending the NULL byte
DEBUG:netmiko:write_channel:
DEBUG:netmiko:Sending the NULL byte
DEBUG:netmiko:write_channel:
DEBUG:netmiko:Sending the NULL byte
DEBUG:netmiko:write_channel:
DEBUG:netmiko:Sending the NULL byte
DEBUG:netmiko:write_channel:
DEBUG:netmiko:Sending the NULL byte
DEBUG:netmiko:write_channel:
DEBUG:netmiko:read_channel:
DEBUG:netmiko:write_channel: <?xml version="1.0" encoding="UTF-8"?><Request MajorVersion="1" MinorVersion="0"><Get><Operational><ARP></ARP></Operational></Get></Request>

DEBUG:netmiko:read_channel:
DEBUG:netmiko:read_channel:
DEBUG:netmiko:read_channel:
DEBUG:netmiko:read_channel:
DEBUG:netmiko:read_channel:
DEBUG:netmiko:read_channel:
DEBUG:netmiko:read_channel:
...
~~~ many other read_channel ~~~
...
DEBUG:netmiko:read_channel:
DEBUG:netmiko:write_channel:

DEBUG:netmiko:read_channel:
DEBUG:netmiko:read_channel:

I am not sure what shall we send instead, or if there's anything we can send at all. What about '\n'?

mirceaulinic commented 6 years ago

If I use always_alive option, and do nothing in 1 minute, I will lose this connection. If I make it false, it becomes very slow to contact with my device.

@noobcoderT If you turn off the always_alive option, Salt will not attempt to keep the connection alive anymore, thus it will start a new SSH connection for each command you execute. So that sounds fine from this perspective (as establishing a connection is pretty heavy).

noobcoderT commented 6 years ago

Thanks for your reply @mirceaulinic . I'm trying to use salt.proxy.napalm module in salt python api. I started a proxy daemon for a network device, and I got 'True' when using salt.proxy.napalm.alive(opts) function even before I used the init() function, and the get_device() func returned an empty dictionary. I have read the doc, but I don't know how to get the __proxy__ variable.

__proxy__['napalm.call']('cli'
                         **{
                            'commands': [
                                'show version',
                                'show chassis fan'
                            ]
                         })

Now I am confused how can I use this proxy if I start it as a daemon but not through python script.

mirceaulinic commented 6 years ago

Hi @noobcoderT you don't need to use the __proxy__ object. This is documented only for developers that will potentially extend these capabilities, but not for users.

What are you trying to do, more specifically? To invoke arbitrary NAPALM methods, you can use the napalm.call execution function: https://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.napalm.html#salt.modules.napalm.call. In general, the public NAPALM methods are available in the existing execution modules, see https://docs.saltstack.com/en/develop/topics/network_automation/index.html#napalm, so you can execute, e.g., bgp.neighbors, bgp.config or net.arp and so on. Is this what you meant?

ktbyers commented 6 years ago

@mirceaulinic Responding to your comment, have you tried to use '\n' or does that break the XML Agent also?

noobcoderT commented 6 years ago

Hi @mirceaulinic , I want to use this napalm module in a python script, but not in salt cli. I want to use salt LocalClient to handle all proxy minions that running as daemon services, but I don't know all these IDs. So I want to know if there is a way to let me get the proxy objects and then operate them. Thanks!

mirceaulinic commented 6 years ago

@ktbyers

Responding to your comment, have you tried to use '\n' or does that break the XML Agent also?

From what I noticed, it doesn't seem to break anything. It actually doesn't do anything either (i.e., it doesn't move to the next line, or display again the prompt).

mirceaulinic commented 6 years ago

@noobcoderT:

I want to use this napalm module in a python script, but not in salt cli. I want to use salt LocalClient to handle all proxy minions that running as daemon services, but I don't know all these IDs.

If your Proxy processes are already running, it's pretty easy:

>>> import salt.client
>>> client = salt.client.get_local_client('/etc/salt/master')
>>> ret = client.cmd(tgt, fun, arg, timeout, tgt_type, ret, jid, kwarg, **kwargs)

The arguments you can send to the cmd function are documented at https://docs.saltstack.com/en/latest/ref/clients/#salt.client.LocalClient.cmd, e.g.,

>>> ret = client.cmd('device1', 'test.ping')
>>> ret
{'device1': True}
>>> ret = client.cmd('device* and G@os:junos and G@model:MX960', 'probes.results', tgt_type='compound')
>>> ret = client.cmd('juniper-routers', 'net.lldp', tgt_type='nodegroup')

But this requires your Proxy processes to be already started. You can equally write a Python script without pre-starting them, but that's slightly more complicated, as you'll basically need to do the Proxy startup, a lighter version of this section https://github.com/saltstack/salt/blob/v2017.7.2/salt/minion.py#L3105-L3174 thus without starting Beacons or the Scheduler.

noobcoderT commented 6 years ago

@mirceaulinic That's great, really big help. Now I know what I should do. Thanks a lot!

ktbyers commented 6 years ago

@mirceaulinic I wonder if it breaks if we are not in XML Agent context (especially the null-byte). I guess we could always enter/exit out of XML Agent if null-byte works in normal SSH session.

i.e. just do a little wrapper that could check for, enter, exit XML Agent.

It would make things slower though...