napalm-automation / napalm-ios

Apache License 2.0
31 stars 40 forks source link

Multiple parallel requests not handled correctly #119

Closed MatthiasGabriel closed 7 years ago

MatthiasGabriel commented 7 years ago

Description of Issue/Question

It seems that napalm-ios doesn't handle multiple parallel requests correctly, while other drivers like junos, iosxr or eos are capable of doing this.

Did you follow the steps from https://github.com/napalm-automation/napalm#faq

Setup

/etc/salt/proxy

######         Thread settings        #####
###########################################
# Disable multiprocessing support, by default when a minion receives a
# publication a new process is spawned and the command is executed therein.
multiprocessing: False

napalm-ios version

(Paste verbatim output from pip freeze | grep napalm-ios between quotes below)

root@ubuntu:/home/test# pip freeze | grep napalm-io
napalm-ios==0.6.1
napalm-iosxr==0.4.5

IOS version

(Paste verbatim output from show version between quotes below)

csr1000v#show version
Cisco IOS XE Software, Version 03.15.00.S - Standard Support Release
Cisco IOS Software, CSR1000V Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.5(2)S, RELEASE SOFTWARE (fc3)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2015 by Cisco Systems, Inc.
Compiled Sun 22-Mar-15 01:36 by mcpre

ROM: IOS-XE ROMMON

Steps to Reproduce the Issue

When I execute multiple salt commands simultaneously they sometimes / most of the time disturb each other. It's possible that you can't reproduce this Issue with your first try. Also the given result is not always the same. Sometimes they just don't deliver the expected output, sometimes there is a traceback.

To reproduce just execute two tasks in parallel e.g.

 salt cisco.csr.1000v net.facts
 salt cisco.csr.1000v net.facts

Error Traceback

(Paste the complete traceback of the exception between quotes below) Sadly the "Error Traceback" is not always the same. I think it depends on the timing... I've collected some of the recent tracebacks and results I've got. If you need more just ask :)

root@ubuntu:/etc/salt# salt cisco.csr.1000v net.traceroute 192.168.16.200
cisco.csr.1000v:
    ----------
    comment:
    out:
        ----------
        success:
            ----------
    result:
        True
[root@ubuntu:/etc/salt# salt cisco.csr.1000v net.traceroute 192.168.16.200
cisco.csr.1000v:
    ----------
    comment:
        Cannot execute "traceroute" on 192.168.16.128 as sshuser. Reason: Search pattern never detected in send_command_expect: csr1000v\#!
    out:
        ----------
    result:
        False
    traceback:
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/salt/proxy/napalm.py", line 294, in call
            out = getattr(NETWORK_DEVICE.get('DRIVER'), method)(**params)  # calls the method with the specified parameters
          File "/usr/local/lib/python2.7/dist-packages/napalm_ios/ios.py", line 1768, in traceroute
            output = self.device.send_command(command, max_loops=max_loops)
          File "/usr/local/lib/python2.7/dist-packages/netmiko/base_connection.py", line 770, in send_command
            search_pattern))
        IOError: Search pattern never detected in send_command_expect: csr1000v\#](url)

root@ubuntu:/etc/salt# salt cisco.csr.1000v net.facts

cisco.csr.1000v:
    ----------
    comment:
        Cannot execute "get_facts" on 192.168.16.128 as sshuser. Reason: Search pattern never detected in send_command_expect: csr1000v\#!
    out:
        ----------
    result:
        False
    traceback:
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/salt/proxy/napalm.py", line 294, in call
            out = getattr(NETWORK_DEVICE.get('DRIVER'), method)(**params)  # calls the method with the specified parameters
          File "/usr/local/lib/python2.7/dist-packages/napalm_ios/ios.py", line 720, in get_facts
            show_hosts = self._send_command('show hosts')
          File "/usr/local/lib/python2.7/dist-packages/napalm_ios/ios.py", line 141, in _send_command
            output = self.device.send_command(command)
          File "/usr/local/lib/python2.7/dist-packages/netmiko/base_connection.py", line 770, in send_command
            search_pattern))
        IOError: Search pattern never detected in send_command_expect: csr1000v\#
root@ubuntu:/etc/salt# salt cisco.csr.1000v net.facts
cisco.csr.1000v:
    ----------
    comment:
        Cannot execute "get_facts" on 192.168.16.128 as sshuser. Reason: Search pattern never detected in send_command_expect: show\ version!
    out:
        ----------
    result:
        False
    traceback:
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/salt/proxy/napalm.py", line 294, in call
            out = getattr(NETWORK_DEVICE.get('DRIVER'), method)(**params)  # calls the method with the specified parameters
          File "/usr/local/lib/python2.7/dist-packages/napalm_ios/ios.py", line 719, in get_facts
            show_ver = self._send_command('show version')
          File "/usr/local/lib/python2.7/dist-packages/napalm_ios/ios.py", line 141, in _send_command
            output = self.device.send_command(command)
          File "/usr/local/lib/python2.7/dist-packages/netmiko/base_connection.py", line 770, in send_command
            search_pattern))
        IOError: Search pattern never detected in send_command_expect: show\ version
root@ubuntu:/etc/salt#` salt cisco.csr.1000v net.facts
cisco.csr.1000v:
    ----------
    comment:
        Cannot execute "get_facts" on 192.168.16.128 as sshuser. Reason: local variable 'domain_name' referenced before assignment!
    out:
        ----------
    result:
        False
    traceback:
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/salt/proxy/napalm.py", line 294, in call
            out = getattr(NETWORK_DEVICE.get('DRIVER'), method)(**params)  # calls the method with the specified parameters
          File "/usr/local/lib/python2.7/dist-packages/napalm_ios/ios.py", line 751, in get_facts
            if domain_name != 'Unknown' and hostname != 'Unknown':
        UnboundLocalError: local variable 'domain_name' referenced before assignment
mirceaulinic commented 7 years ago

Hi @MatthiasGabriel - thanks for reporting this!

I've had the same issues with IOS-XR in the past. As all requests are executed over the same session, the commands will overlap. All we can do with IOS/IOS-XE is screen scarping, so we don't have any mechanism similar to NETCONF able to assign an unique request ID and then identify.

The solution I found in XR is a simple and straightforward mutex lock-unlock mechanism. Doing so, each request is queued and served when the channel is "free" again. The whole code necessary is in https://github.com/fooelisa/pyiosxr/blob/master/pyIOSXR/iosxr.py#L189-L199. Ideally we should have had a way to uniquely identify each request, but the device unfortunately does not provide any feature for this. So, it won't be the most optimal solution, but I can say that it really did the job for XR, at least.

So I will ask @ktbyers what he thinks about that.

Cheers, Mircea

ktbyers commented 7 years ago

Yes, that should be fine.

I assume the user would just manually acquire the lock (i.e. the default would be no locking).

mirceaulinic commented 7 years ago

The configuration lock will not be touched in that case - it's just a naming coincidence. All it does is sort of "virtual" lock per command, so two or more requests over the same SSH session won't overlap. So it will preserve the current behaviour. I have an IOS device to test next week and I can give it a try and see.

dbarrosop commented 7 years ago

@mirceaulinic is that because salt is reusing the same ssh connection? Calling getters over different ssh connections should be fine.

mirceaulinic commented 7 years ago

Yes, that's correct @dbarrosop - they are using the same SSH connection.

dbarrosop commented 7 years ago

Disclaimer: Now I am talking out loud mostly out of curiosity and ignorance.

Could the minion just fork itself for each "set of queries"? On minion creation you just save the data necessary to create a new connection and everytime salt sends a query to the minion you just fork a process, create a new instance of the object, execute the query/queries and close it. This would allow you to run parallel queries to the same device (I don't know if minions have some limitations that wouldn't allow you to do this). This combined with "ssh control connections" should give some nice extra functionality without the extra penalty of opening a new connection.

mirceaulinic commented 7 years ago

Unfortunately no, thats not how proxy minions work today. They've been designed to keep the connection always alive, to be similar to the regular minions. However, the main difference is that regular minions don't use SSH as transport, so our case is slightly different. Anyway, going full parallel should not be a limitation IMHO and we should address that.

mirceaulinic commented 7 years ago

Hi @MatthiasGabriel - netmiko 1.3.0 has been released and includes the changes from https://github.com/ktbyers/netmiko/pull/407. Can you upgrade and confirm it works now? For the moment I am going to close this issue, but feel free to reopen if still not fixed.