Paramiko exception when running backup job on multiple Cisco XR devices

jonathondamidco commented 1 month ago

Environment

Python version: 3.11.9
Nautobot version: 2.3.1
nautobot-golden-config version: 2.1.1

Expected Behavior

Backup job should back up the config for any number of Cisco IOS XR devices. If a single-device run on the backup job succeeds, it should succeed when run against a group of devices.

Observed Behavior

I am able to run the Golden Config backup job against single IOS XR devices without issues. Two devices also seems to work fine. When I run the backup job against a group of devices (either by specifying each of the devices in the Devices field or by using another filter like Role), a few of them succeed, but the majority of them will fail with this error:

Failed with a timeout issue. ` A paramiko SSHException occurred during connection creation:

Error reading SSH protocol banner[Errno 104] Connection reset by peer

`

Ran it again on only one of the devices that failed, and that job succeeded.

In an example where I ran this against a role with 18 devices, 7 succeeded and 11 failed with this error. At two devices, they both succeeded. At three, one failed and two succeeded. With four devices, two failed and two succeeded. At six devices, one failed and five succeeded. The number of devices that succeed and fail does not seem to be consistent other than one or two devices almost always succeeding.

I also have a number of Cisco IOS and IOS XE devices. I have not encountered this issue when running against any number of XE devices.

Steps to Reproduce

Set up device platform
- Name: cisco_xr
- Manufacturer: Cisco
- Network driver: cisco_xr
- NAPALM driver: iosxr
Set up at least 15 Cisco IOS XR devices in Nautobot and assign them to a role.
Run the Golden Config backup job on the role
Observe the output.
Repeat the Golden Config backup job on a single device that failed during the first job run.

Exported log of 18 devices where a number of them failed: nautobot_joblogentry_data (5).csv

Exported log of a single device that previously failed, but now succeeds: nautobot_joblogentry_data (6).csv

jonathondamidco commented 1 month ago

In the above log exports, ASRC-02-SITE5 is the device I re-ran the config backup job on individually. That device, along with several others, failed during the group backup job. It succeeded on the individual job but failed on the group job. That should prove out a device-related issue since Nautobot can pull the config backup when the job is run only when run against an individual device.

itdependsnetworks commented 1 month ago

This is probably best for slack, sign up here, if not already there http://slack.networktocode.com/.

I do understand your troubleshooting perspective, that Nautobot in group fails, Nautobot with single succeeds, the only thing that has change is in Nautobot, but there is simply more to it. To give examples of real life troubleshooting where that was proven to not be the case.

Nautobot group of devices causes a lot of tacacs traffic, devices hold up responding and causes issues.
Nautobot group of devices happen during packet overload of some key network infra and packets drop which causes the timeout.
Undersized worker can't handle the resources required for group vs the single job run

My gut would say, increase the timeout for napalm and/or try Netmiko.

We are here to help, but, also trying to set expectations for what communication often happens on Github vs Slack and providing clarity on the certainty there is an actual bug vs settings for your unique environment. Will close out tomorrow without more conclusively showing it is an issue with Nautobot Golden Config.

jonathondamidco commented 1 month ago

Here's the Slack thread where I have brought it up in the past: https://networktocode.slack.com/archives/C01NWPK6WHL/p1723652618173629

I had previously thought it had to do with XR version or SSH kex algorithms, but I don't think that's the case anymore. The host running Nautobot can SSH to the devices without issue, and single-device job runs succeed just fine.

I will look into the TACACS idea. If this were an issue, I would expect it to also occur when I run the backup job against IOS XE devices.

I don't see timeout as one of the optional_args in the NAPALM docs: https://napalm.readthedocs.io/en/latest/support/index.html - can you let me know what I should set (and where) in order to test it out?

I'm also not certain how to switch it out for Netmiko if you can point me to any documentation on that.

This is something I've brought up to NTC and had a meeting with a solutions architect who advised me to open this Issue after I showed him the behavior. I'm not still not completely ruling out my environment as the cause here, but I haven't been able to determine the root of the issue. All I really know is it works for XE devices just fine, but not XR devices.

itdependsnetworks commented 1 month ago

I had previously thought it had to do with XR version or SSH kex algorithms, but I don't think that's the case anymore. The host running Nautobot can SSH to the devices without issue, and single-device job runs succeed just fine.

That would seem to be the case, I would start a new thread, I just read through through the prior one, seems like we made good progress.

I'm also not certain how to switch it out for Netmiko if you can point me to any documentation on that.

In that thread it was linked and shown as you tried {"cisco_xr": "netmiko"} at the time.

solutions architect who advised me to open this Issue after I showed him the behavior

That would generally be our default without knowing if it is a bug or not. This seems to not be a bug, but I reserve the right to be wrong :)

nautobot / nautobot-app-golden-config