ktbyers / netmiko

Multi-vendor library to simplify Paramiko SSH connections to network devices
MIT License
3.61k stars 1.31k forks source link

120 second teardown after command with excessively long output #2744

Closed zohassadar closed 2 years ago

zohassadar commented 2 years ago

The condition is a command that takes an excessively long time to complete while providing a continuous stream of data during execution. In this example, the command is ping 192.168.42.1 size 18024 repeat 2147483647. The read_timeout, adjusted to 2 seconds, works as expected. During the teardown the call to check_config_mode, which in turn calls read_channel_timing, introduces a 120 second delay. This is due to read_channel_timing's default read_timeout value of 120.0.

import os
import time

import netmiko

def send_command_with_timeout_context(command: str, read_timeout: int):
    pre_start = time.time()
    with netmiko.ConnectHandler(
        host="192.168.42.254",
        device_type="cisco_ios",
        username=os.getenv("USERNAME"),
        password=os.getenv("PASSWORD"),
    ) as session:
        try:
            error = None
            start = time.time()
            session.send_command(command, read_timeout=read_timeout)
        except netmiko.ReadTimeout:
            error = 'timeout'
        finally:
            pre_finish = time.time()
    finish = time.time()
    print(f"{command=} {read_timeout=} setup={start-pre_start:.02f} execution={pre_finish-start:.02f} teardown={finish-pre_finish:.02f} {error=}")

send_command_with_timeout_context("ping 192.168.42.1 size 18024 repeat 2147483647", 2)

The result:

command='ping 192.168.42.1 size 18024 repeat 2147483647' read_timeout=2 setup=0.78 execution=2.05 teardown=120.01 error='timeout'

A lower read_timeout value of 2.0 (chosen to match the default value of last_read) when called from check_config_mode would solve this issue, and in normal conditions the results from the carriage return sent shouldn't require more than 2 seconds. Attaching pull request. Willing to adjust if there's more that needs to be taken into consideration.

The result after adjusting the read_timeout:

command='ping 192.168.42.1 size 18024 repeat 2147483647' read_timeout=2 setup=0.73 execution=2.05 teardown=2.00 error='timeout'
ktbyers commented 2 years ago

@zohassadar Is there any broader purpose here besides testing? I.E. I can't really see any reason to do the above besides to create an artificial test case that gums up the CLI with a continuously running command and output (obviously you can unintentionally gum up the CLI by accidentally executing a command that you shouldn't).

So the issue that you are running into is that the CLI command never completes and consequently Netmiko can no longer gracefully exit from the CLI (or more exactly you are causing there to always be output on the screen so Netmiko's read_channel_timing() never thinks it is done reading until the absolute timer expires after 120 seconds).

I will tweak the timeout for that check_config_mode() down, but it will still probably need to be more in the 5 to 10 second range. 2 seconds is probably too aggressive though i.e. some devices will probably fail to properly exit config mode in that time.

zohassadar commented 2 years ago

The purpose is to better understand the behavior of the new read_timeout feature. I need to replace several instances of max_loops and delay_factor at some point, most of which were implemented when I didn't understand what was going on under the hood at all. I was following along with this blog post when I came across the issue of the delay in the context manager exit.

A better example would be a valid command that would complete in a reasonable amount of time if given the chance and leaving everything else default, similar to the show tech example in the blog. With my 3750 running ipbase, I'm not aware of any long running commands, but 5000 large pings could be the stand-in for a large 'show tech', as that takes about 80 seconds to complete on my network.

command='ping 192.168.42.1 size 18024 repeat 5000' read_timeout=20 setup=0.73 execution=20.06 teardown=60.30 error='timeout'

During the 80 seconds that it took to complete, 20 seconds were spent in send_command and the remaining 60 seconds were spent in read_channel_timing due to the teardown. From a beginner user perspective without any idea of what's happening, I would think that read_timeout was broken, when it's working exactly as it's supposed to.

As check_config_mode is called for reasons other than just gracefully shutting down the connection, I can see why 2 seconds wouldn't work for all cases. Another approach may be to distinguish calls to check_config_mode with a flag that indicates that it's being called as part of the cleanup process and not as a precursor to entering config mode. The flag can then be used to specify a much more aggressive timing strategy during cleanup.

zohassadar commented 2 years ago

https://github.com/zohassadar/netmiko/commit/d317f816a68f1f1dec6082bbc511d6e3fe1db29b

Here's an example that uses a flag set by disconnect that read_channel_timing uses to shorten the read_timeout to match the value of last_read.

command='ping 192.168.42.1 size 18024 repeat 5000' read_timeout=20 setup=0.72 execution=20.06 teardown=2.00 error='timeout'

The teardown now takes 2 seconds.

ktbyers commented 2 years ago

Okay, makes sense.

I still would want it to be 5 to 10 seconds in this particular failing case as well as in general.

In other words, you do really want to gracefully exit the device here (and the context is a fairly abnormal one i.e. the CLI is totally messed-up due to previously long running command). In other words, it is much closer to the 1% case than the other 99% of cases where you just check_config_mode() normally and then exit the SSH session.

Yeah, the teardown is an issue and I will tweak the check_config_mode() downward to probably 5 to 10 seconds.

You really shouldn't see the teardown time (in your measurement) unless you are doing your timing measurement wrong though. In other words, proper measurement of the time requires that you place the end time measurement as close as possible to when send_command() has failed which would be inside the finally clause like you did here:

        finally:
            pre_finish = time.time()

I am not sure there is much we can do about that i.e. if someone is measuring read_timeout and does the measurement wrong.

I am not particularly concerned about a 10 second disconnect time (in a rare and odd situation). In other words, reliability in the 99% case matters a lot more.

ktbyers commented 2 years ago

Fixed here:

https://github.com/ktbyers/netmiko/pull/2751