thatmattlove / hyperglass

hyperglass is the network looking glass that tries to make the internet better.
https://hyperglass.dev
BSD 3-Clause Clear License
617 stars 93 forks source link

Junos traceroutes occasionally fail #179

Open ggidofalvy-tc opened 2 years ago

ggidofalvy-tc commented 2 years ago

Bug Description

Hey folks,

I've noticed this issue on our LG deployment: traceroutes fail quite often, with the undescriptive 'Something went wrong.' error message appearing, ca. 10-15 seconds after submitting the traceroute request. There's no really that much in the logs, even with debugging turned on, please see below.

I made sure that this is not a connectivity/permissions issue, I tried to connecting to the same router with the username + password directly copied out of the configuration file from the same host, as the hyperglass user; I was successfully able to run the command constructed by Hyperglass.

I've found that this behaviour occurs when trying to traceroute to destinations that might take a long time to run (read: lots of timeouts in the intermediate hops) -- I already tried increasing the request_timeout to 120, but it seems like the query fails much, much earlier than even the default 90s request timeout.

Expected behavior

The result of the traceroute is returned.

Steps to Reproduce

  1. Set up a Junos router
  2. Pick a traceroute destination that might take a long time to traceroute
  3. Submit the query

Local Configurations

I added request_timeout: 120 to hyperglass.yaml as an attempt to fix the problem. Additionally, in my devices.yaml, all Junos devices have netmiko set as their driver instead of scrapli, due to other issues impacting the scrapli driver.

In an attempt to track down this issue, debug: true has also been added to hyperglass.yaml

Logs

Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.models.api.validators:24 | _member_of → Checking membership of XX.XX.XX.XX/32 for 0.0.0.0/0
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.models.api.validators:31 | _member_of → XX.XX.XX.XX/32 is a member of 0.0.0.0/0
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.models.api.validators:109 | validate_ip → XX.XX.XX.XX/32 is allowed by access-list AccessList4(network=IPv4Network('0.0.0.0/0'), action='permit', ge=0, le=32)
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.models.api.validators:131 | validate_ip → Converted 'XX.XX.XX.XX/32' to 'XX.XX.XX.XX' for 'traceroute' query
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.models.api.validators:180 | validate_ip → Validation passed for XX.XX.XX.XX
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.api.routes:80 | query → Initialized cache HyperglassCache(db=1, host=localhost, port=6379, password=None)
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.api.routes:89 | query → Cache Timeout: 120
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [INFO] 20211207 18:33:12 | hyperglass.api.routes:90 | query → Starting query execution for query Query(query_location=asdasdasdasd, query_type=traceroute, query_vrf=master, query_target=XX.XX.XX.XX)
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.api.routes:116 | query → No existing cache entry for query b43a845d27d8e0555179232203b18216d695981aa4ecaebfc2742a434b697218
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.api.routes:118 | query → Created new cache key b43a845d27d8e0555179232203b18216d695981aa4ecaebfc2742a434b697218 entry for query Query(query_location=asdasdasdasd, query_type=traceroute, query_vrf=master, query_target=XX.XX.XX.XX)
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.execution.main:49 | execute → Received query for {"query_location": "asdasdasdasd", "query_type": "traceroute", "query_vrf": {"name": "master", "display_name": "Global", "info": {"bgp_aspath": {"enable": true, "file": null, "params": {"title": null}}, "bgp_community": {"enable": true, "file": null, "params": {"title": null}}, "bgp_route": {"enable": true, "file": null, "params": {"title": null}}, "ping": {"enable": true, "file": null, "params": {"title": null}}, "traceroute": {"enable": true, "file": null, "params": {"title": null}}}, "ipv4": {"source_address": "Y.Y.Y.Y", "access_list": [{"network": "0.0.0.0/0", "action": "permit", "ge": 0, "le": 32}], "force_cidr": false, "protocol": "ipv4_default", "version": 4}, "ipv6": {"source_address": "Z:Z:Z::Z", "access_list": [{"network": "::/0", "action": "permit", "ge": 0, "le": 128}], "force_cidr": false, "protocol": "ipv6_default", "version": 6}, "default": true}, "query_target": "XX.XX.XX.XX", "timestamp": "2021-12-07 18:33:12"}
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.execution.main:50 | execute → Matched device config: name='router.xxx.yyy' address='router.xxx.yyy' network=Network(name='production', display_name='AS11111') credential=Credential(username='hyperglass', password=SecretStr('***'), key=None, _method='password') proxy=None display_name=None port=22 ssl=None nos='juniper' commands='juniper' vrfs=[Vrf(name='master', display_name='Global', info=Info(bgp_aspath=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), bgp_community=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), bgp_route=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), ping=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), traceroute=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None))), ipv4=DeviceVrf4(source_address=IPv4Address('Y.Y.Y.Y'), access_list=[AccessList4(network=IPv4Network('0.0.0.0/0'), action='permit', ge=0, le=32)], force_cidr=False, protocol='ipv4_default', version=4), ipv6=DeviceVrf6(source_address=IPv6Address('Z:Z:Z::Z'), access_list=[AccessList6(network=IPv6Network('::/0'), action='permit', ge=0, le=128)], force_cidr=False, protocol='ipv6_default', version=6), default=True)] display_vrfs=['Global'] vrf_names=['master'] structured_output=True driver='netmiko'
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.execution.drivers._construct:27 | __init__ → Constructing traceroute query for 'XX.XX.XX.XX'
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.execution.drivers._construct:111 | queries → Constructed query: ['traceroute inet XX.XX.XX.XX wait 1 source Y.Y.Y.Y']
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.execution.drivers.ssh_netmiko:57 | collect → Connecting directly to router.xxx.yyy
########## At 18:33:20-30-ish, this is where I received the "Something went wrong." message on the frontend.
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]: [CRITICAL] 20211207 18:35:11 | hyperglass.exceptions:40 | __init__ → [DANGER] Error connecting to router.xxx.yyy: Request timed out.
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]: Exception in callback Loop._read_from_self
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]: handle: <Handle Loop._read_from_self>
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]: Traceback (most recent call last):
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:   File "uvloop/cbhandles.pyx", line 73, in uvloop.loop.Handle._run
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:   File "uvloop/loop.pyx", line 359, in uvloop.loop.Loop._read_from_self
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:   File "uvloop/loop.pyx", line 364, in uvloop.loop.Loop._invoke_signals
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:   File "uvloop/loop.pyx", line 339, in uvloop.loop.Loop._ceval_process_signals
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:   File "/opt/hyperglass/.local/lib/python3.7/site-packages/hyperglass/execution/main.py", line 39, in handler
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:     raise DeviceTimeout(**exc_args)
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]: hyperglass.exceptions.DeviceTimeout: Error connecting to router.xxx.yyy: Request timed out.

Possible Solution

Environment

Server

Metric Value
hyperglass Version 1.0.4
hyperglass Path /opt/hyperglass/hyperglass
Python Version 3.7.3
Node Version 14.18.2
Platform Info Linux-5.6.0-0.bpo.2-cloud-amd64-x86_64-with-debian-10.11
CPU Info AMD EPYC Processor (with IBPB)
Logical Cores 2
Physical Cores 2
Processor Speed 1.999999GHz
Total Memory 4.13 GB
Memory Utilization 19.5%
Total Disk Space 32.22 GB
Disk Utilization 13.7%

Client

MorningLightMountain713 commented 2 years ago

Netmiko is a poor driver for long running junos commands. You can alter nemiko_delay_factor to make it work, but this is a global variable, and will slow down every other command. We run solely Juniper in our network, so I wrote a pyez driver that uses RPCs instead of scraping. It works will all commands (including traceroute). I host it privately, but if you're interested, can fork this repo and push it there.

ggidofalvy-tc commented 2 years ago

@DavidReubenWhite I'd really appreciate that!

MorningLightMountain713 commented 2 years ago

@ggidofalvy-tc Give this a try (add_pyez_driver branch)

Note - I only use vrf's in our network so haven't tested the commands that don't use vrfs. They should work though. Also we don't use the communities command, untested.

thatmattlove commented 2 years ago

For what it's worth, I've added the capability to add custom Netmiko arguments on a per-device basis in the v2.0.0 branch, via b49b6ea. I don't particularly want to add another driver to hyperglass, especially one that only covers one vendor. At least, not without adding extras to the hyperglass package.

In addition to this feature, I'm also happy to add any Juniper-specific Netmiko arguments as defaults (applied to every Juniper SSH session), if anyone has a magic recipe.

carlmontanari commented 2 years ago

Sorry to issue-jack -- Happened to stumble onto this:

due to other issues impacting the scrapli driver

And did a bit of poking around and saw a prior issue or two that I assume are related. I am 99% sure that current develop branch of scrapli solves the scrapli+junos issues that I saw. TL;DR the scrapli prompt finding was breaking w/ junos because xml sure can look a lot like an exec/normal prompt. Fixed here with a test so this doesn't regress 😁 . Feel free to ignore, just saw it and figured I'd comment!

Carl

thatmattlove commented 3 months ago

Can anyone verify if this is still an issue on v2 and provide details if so?

saint-kid commented 2 months ago

ping and traceroute does not work. with the following error and bgp route works

hyperglass-1 | 2024-06-15 13:32:36.037 | DEBUG | hyperglass.models.directive:membership:111 - Checking target membership hyperglass-1 | 2024-06-15 13:32:36.038 | DEBUG | hyperglass.models.directive:membership:116 - Target membership verified hyperglass-1 | 2024-06-15 13:32:36.038 | DEBUG | hyperglass.models.directive:in_range:123 - Target is in range hyperglass-1 | 2024-06-15 13:32:36.039 | DEBUG | hyperglass.models.api.query:validate_query_target:107 - Validation passed hyperglass-1 | 2024-06-15 13:32:36.039 | INFO | hyperglass.api.routes:query:78 - Starting query execution hyperglass-1 | 2024-06-15 13:32:36.040 | DEBUG | hyperglass.api.routes:query:96 - Cache miss hyperglass-1 | 2024-06-15 13:32:36.043 | DEBUG | hyperglass.execution.main:execute:51 - hyperglass-1 | 2024-06-15 13:32:36.046 | DEBUG | hyperglass.execution.drivers._construct:init:46 - Constructing query hyperglass-1 | 2024-06-15 13:32:36.048 | DEBUG | hyperglass.execution.drivers._construct:queries:129 - Constructed query hyperglass-1 | 2024-06-15 13:32:36.052 | DEBUG | hyperglass.execution.drivers.ssh_netmiko:collect:56 - Connecting to device hyperglass-1 | INFO - 2024-06-15 13:32:36,092 - paramiko.transport - transport - Connected (version 2.0, client OpenSSH_7.5) hyperglass-1 | INFO - 2024-06-15 13:32:36,459 - paramiko.transport - transport - Authentication (password) successful! hyperglass-1 | 2024-06-15 13:32:46.947 | CRITICAL | hyperglass.api.error_handlers:default_handler:48 - Error hyperglass-1 | 2024-06-15 13:32:46.949 | INFO | logging:callHandlers:1762 - 10.10.100.21:52448 - "POST /api/query HTTP/1.1" 500

mattpenn20 commented 2 months ago

@thatmattlove I can confirm that this seems to still be an issue as of v2.0.4

[20240701 16:16:46] Validation passed {'query': Query(query_location='DEVICE', query_target=['8.8.8.8'], query_type='juniper-traceroute-custom')} [20240701 16:16:46] Starting query execution {'query': Query(query_location='DEVICE', query_target=['8.8.8.8'], query_type='juniper-traceroute-custom')} [20240701 16:16:46] Cache miss {'query': Query(query_location='DEVICE', query_target=['8.8.8.8'], query_type='juniper-traceroute-custom'), 'cache_key': 'hyperglass.query.dd24d57262dbf8c8537bd2c6440a9b00dd04023b2107a63f17a0f0d0d2fec8e6'} [20240701 16:16:46] {'query': Query(query_location='DEVICE', query_target=['8.8.8.8'], query_type='juniper-traceroute-custom'), 'device': 'DEVICE'} [20240701 16:16:46] Constructing query {'type': 'juniper-traceroute-custom', 'target': ['8.8.8.8']} [20240701 16:16:46] Constructed query {'type': 'juniper-traceroute-custom', 'target': ['8.8.8.8'], 'constructed_query': ['traceroute inet 8.8.8.8 wait 2 source (SOURCE IP) no-resolve']} [20240701 16:16:46] Connecting to device {'device': 'DEVICE', 'address': 'None:None', 'proxy': None} [20240701 16:16:58] Error {'method': 'POST', 'path': '/api/query', 'detail': "\nPattern not detected: 'hyperglass@DEVICE>' in output.\n\nThings you might try to fix this:\n1. Explicitly set your pattern using the expect_string argument.\n2. Increase the read_timeout to a larger value.\n\nYou can also look at the Netmiko session_log or debug log for more information.\n\n"} [20240701 16:16:58] (DEVICE IP) - "POST /api/query HTTP/1.1" 500 {} [20240701 16:18:15] Request timed out. (Connection timed out) {}

I've created a custom directive to test wait times, and noticed that adding no-resolve helps more traceroutes complete, likely due to the decreased response time on the juniper.

I tested different wait times, but didn't really see any improvements from changing this.

Testing this takes quiet a bit of time due to hyperglass taking about ~2 minutes to startup on docker, so I created a simple python script with netmiko ssh to compare, and found that I see similar issues (i.e similar length traceroutes timing out) when I set the read_timeout to anything less than 5 - which is odd considering, that netmiko seems to have default timeout of 10.

I attempted to use driver_config in device yaml settings to add a read_timeout of 10 but got the error: Error {'method': 'POST', 'path': '/api/query', 'detail': "BaseConnection.init() got an unexpected keyword argument 'read_timeout'"}

juniper-traceroute-custom: name: Traceroute rules:

mattpenn20 commented 2 months ago

Another quick update on this, I tested setting request_timeout to 1200, and it still fails on the same traceroutes. I am currently testing on the longest traceroute I can find ~27 hops, though I have seen it fail on as few as 8 hops. It takes the exact same time to fail with a high request_timeout vs. the default 90, with the time between connection initiated and the netmiko error being pretty much exactly 12 seconds each time.

astlaurent commented 1 month ago

Another quick update on this, I tested setting request_timeout to 1200, and it still fails on the same traceroutes. I am currently testing on the longest traceroute I can find ~27 hops, though I have seen it fail on as few as 8 hops. It takes the exact same time to fail with a high request_timeout vs. the default 90, with the time between connection initiated and the netmiko error being pretty much exactly 12 seconds each time.

Hi I am seeing the same issue also on XR as well as Juniper. I attempted to adjust the read_timeout without much luck with this in the driver py file send_args['read_timeout'] = 120 You have any luck finding anything that works

astlaurent commented 1 month ago

FYI I figured out that the read_timeout does work fine but it has to be edited inside the docker. You can refer to my issue https://github.com/thatmattlove/hyperglass/issues/274