thatmattlove / hyperglass

hyperglass is the network looking glass that tries to make the internet better.
https://hyperglass.dev
BSD 3-Clause Clear License
645 stars 102 forks source link

BGP AS Path Timeouts - Juniper #156

Open pmawsonau opened 3 years ago

pmawsonau commented 3 years ago

When performing a BGP AS Path (regex) lookup that results in a large number of results, connections to a Juniper router times out waiting for response and no output is displayed.

[DEBUG] 20210807 17:56:08 | hyperglass.api.routes:80 | query ‚Üí Initialized cache HyperglassCache(db=1, host=localhost, port=6379, password=None) [DEBUG] 20210807 17:56:08 | hyperglass.api.routes:89 | query ‚Üí Cache Timeout: 120 [INFO] 20210807 17:56:08 | hyperglass.api.routes:90 | query ‚Üí Starting query execution for query Query(query_location=melbourne, query_type=bgp_aspath, query_vrf=global, query_target=1 1221) [DEBUG] 20210807 17:56:08 | hyperglass.api.routes:116 | query ‚Üí No existing cache entry for query ad3ad05add3c7d0e89dc8e264c7eae75fb6e443b69373c5f2f03fcaa35359797 [DEBUG] 20210807 17:56:08 | hyperglass.api.routes:118 | query ‚Üí Created new cache key ad3ad05add3c7d0e89dc8e264c7eae75fb6e443b69373c5f2f03fcaa35359797 entry for query Query(query_locati ion=melbourne, query_type=bgp_aspath, query_vrf=global, query_target=1221) [DEBUG] 20210807 17:56:08 | hyperglass.execution.main:49 | execute ‚Üí Received query for {"query_location": "melbourne", "query_type": "bgp_aspath", "query_vrf": {"name": "global", "disp play_name": "Global", "info": {"bgp_aspath": {"enable": true, "file": null, "params": {"title": null}}, "bgp_community": {"enable": true, "file": null, "params": {"title": null}}, "bgp_route": {"enable": true, "file": null, "params": {"title": null}}, "ping": {"enable": true, "file": null, "params": {"title": null}}, "traceroute": {"enable": true, "file": null, "params": {"title": null}}}, "ipv4": {"source_address": "203.134.x.x", "access_list": [{"network": "0.0.0.0/0", "action": "permit", "ge": 0, "le": 32}], "force_cidr": true, "protocol": "ipv4_default", "version": 4}, "ipv6": {"source_address": "2403:4800::xx", "access_list": [{"network": "::/0", "action": "permit", "ge": 0, "le": 128}], "force_cidr": true, "protocol": "ipv6_default", "version": 6}, "default": true}, "query_target": "1221", "timestamp": "2021-08-07 07:56:08"} [DEBUG] 20210807 17:56:08 | hyperglass.execution.main:50 | execute ‚Üí Matched device config: name='Melbourne' address=IPv4Address('203.134.x.x') network=Network(name='production', disp play_name='AS9443') credential=Credential(username='svcLookingGlass', password=SecretStr('**'), key=None, _method='password') proxy=None display_name=None port=22 ssl=None nos='juniper' commands='juniper' vrfs=[Vrf(name='global', display_name='Global', info=Info(bgp_aspath=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), bgp_community=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), bgp_route=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), ping=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), traceroute=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None))), ipv4=DeviceVrf4(source_address=IPv4Address('203.134.x.x'), access_list=[AccessList4(network=IPv4Network('0.0.0.0/0'), action='permit', ge=0, le=32)], force_cidr=True, protocol='ipv4_default', version=4), ipv6=DeviceVrf6(source_address=IPv6Address('2403:4800::xx'), access_list=[AccessList6(network=IPv6Network('::/0'), action='permit', ge=0, le=128)], force_cidr=True, protocol='ipv6_default', version=6), default=True)] display_vrfs=['Global'] vrf_names=['global'] structured_output=True driver='scrapli' [DEBUG] 20210807 17:56:08 | hyperglass.execution.drivers._construct:27 | init ‚Üí Constructing bgp_aspath query for '1221' [DEBUG] 20210807 17:56:08 | hyperglass.execution.drivers._construct:166 | _juniper_bgp_aspath ‚Üí Modified target '1221' to '. 1221 .' [DEBUG] 20210807 17:56:08 | hyperglass.execution.drivers._construct:111 | queries ‚Üí Constructed query: ['show route protocol bgp table inet.0 aspath-regex ". 1221 ." detail | display xml', 'show route protocol bgp table inet6.0 aspath-regex ". 1221 ." detail | display xml'] [DEBUG] 20210807 17:56:08 | hyperglass.execution.drivers.ssh_scrapli:83 | collect ‚Üí Connecting directly to Melbourne [INFO] 20210807 17:56:08 | scrapli.driver.base.base_driver:930 | _pre_open_closing_log ‚Üí opening connection to '203.134.x.x' on port '22' [INFO] 20210807 17:56:08 | scrapli.driver.network.base_driver:312 | _process_acquire_priv ‚Üí attempting to acquire 'exec' privilege level [DEBUG] 20210807 17:56:08 | scrapli.driver.network.base_driver:133 | _determine_current_priv ‚Üí determined current privilege level is one of '['exec']' [DEBUG] 20210807 17:56:08 | scrapli.driver.network.base_driver:322 | _process_acquire_priv ‚Üí determined current privilege level is target privilege level, no action needed [INFO] 20210807 17:56:08 | scrapli.driver.base.base_driver:949 | _post_open_closing_log ‚Üí connection to '203.134.x.x' on port '22' opened successfully [CRITICAL] 20210807 17:56:37 | hyperglass.exceptions:40 | init ‚Üí [DANGER] Error connecting to Melbourne: Request timed out.

If running a test on an AS with minimal routes, it works as expected

xml', 'show route protocol bgp table inet6.0 aspath-regex ". 38285 ." detail | display xml'] [DEBUG] 20210807 17:58:40 | hyperglass.execution.drivers.ssh_scrapli:83 | collect ‚Üí Connecting directly to Melbourne [INFO] 20210807 17:58:40 | scrapli.driver.base.base_driver:930 | _pre_open_closing_log ‚Üí opening connection to '114.31.206.79' on port '22' [INFO] 20210807 17:58:41 | scrapli.driver.network.base_driver:312 | _process_acquire_priv ‚Üí attempting to acquire 'exec' privilege level [DEBUG] 20210807 17:58:41 | scrapli.driver.network.base_driver:133 | _determine_current_priv ‚Üí determined current privilege level is one of '['exec']' [DEBUG] 20210807 17:58:41 | scrapli.driver.network.base_driver:322 | _process_acquire_priv ‚Üí determined current privilege level is target privilege level, no action needed [INFO] 20210807 17:58:41 | scrapli.driver.base.base_driver:949 | _post_open_closing_log ‚Üí connection to '203.134.x.x' on port '22' opened successfully .... [DEBUG] 20210807 17:58:53 | hyperglass.api.routes:134 | query ‚Üí Query 39ffea92ebbb75bb01e3a8900f1dc50576a9d3dcc67f5d3f94aa4a6adb55e1cd took 13.1272 seconds to run. [DEBUG] 20210807 17:58:53 | hyperglass.api.routes:148 | query ‚Üí Added cache entry for query: 39ffea92ebbb75bb01e3a8900f1dc50576a9d3dcc67f5d3f94aa4a6adb55e1cd [DEBUG] 20210807 17:58:53 | hyperglass.api.routes:159 | query ‚Üí Cache match for 39ffea92ebbb75bb01e3a8900f1dc50576a9d3dcc67f5d3f94aa4a6adb55e1cd:

jakeb91 commented 3 years ago

I'm going to +1 this issue, having similar issues with Arista EOS output. It seems that basically any query that takes a 'long' time to return any data via SSH will have this problem, EOS is particularly bad for this at the moment (community / as-path queries on them are just slow in general). I've noticed this also affects traceroutes that are slow to resolve / complete too.

It would be good if we had a way of adjusting this timeout manually as it doesn't seem to be affected by the query timeout option available in hyperglass.yaml . You can get around it by setting netmiko_delay_factor to an arbitrarily large value but it'd be good to only apply that delay factor to commands that need it (I'm also not sure if a similar option is exposed for scrapli).

Example below (have tested this command directly and can confirm it is working, the output just takes a long time to be returned). It's important to note that the actual timeout setting is being respected in the logs below, but the UI is throwing a 'Something went wrong' error about 10-15 seconds after attempting the query at 22:39:54.

Aug 8 22:39:54 app03 hyperglass[3703797]: [DEBUG] 20210808 22:39:54 | hyperglass.execution.drivers._construct:27 | init → Constructing bgp_community query for '64512:10503' Aug 8 22:39:54 app03 hyperglass[3703797]: [DEBUG] 20210808 22:39:54 | hyperglass.execution.drivers._construct:111 | queries → Constructed query: ['show ip bgp community 64512:10503 vrf public | json', 'show ipv6 bgp community 64512:10503 vrf public | json'] Aug 8 22:39:54 app03 hyperglass[3703797]: [DEBUG] 20210808 22:39:54 | hyperglass.execution.drivers.ssh_netmiko:57 | collect → Connecting directly to PER01 - Perth, WA Aug 8 22:41:53 app03 hyperglass[3703797]: Exception in callback Loop._read_from_self Aug 8 22:41:53 app03 hyperglass[3703797]: handle: Aug 8 22:41:53 app03 hyperglass[3703797]: Traceback (most recent call last): Aug 8 22:41:53 app03 hyperglass[3703797]: File "uvloop/cbhandles.pyx", line 73, in uvloop.loop.Handle._run Aug 8 22:41:53 app03 hyperglass[3703797]: File "uvloop/loop.pyx", line 359, in uvloop.loop.Loop._read_from_self Aug 8 22:41:53 app03 hyperglass[3703797]: File "uvloop/loop.pyx", line 364, in uvloop.loop.Loop._invoke_signals Aug 8 22:41:53 app03 hyperglass[3703797]: File "uvloop/loop.pyx", line 339, in uvloop.loop.Loop._ceval_process_signals Aug 8 22:41:53 app03 hyperglass[3703797]: File "/usr/local/lib/python3.6/site-packages/hyperglass/execution/main.py", line 39, in handler Aug 8 22:41:53 app03 hyperglass[3703797]: raise DeviceTimeout(**exc_args) Aug 8 22:41:53 app03 hyperglass[3703797]: hyperglass.exceptions.DeviceTimeout: Error connecting to PER01 - Perth, WA: Request timed out. Aug 8 22:41:53 app03 hyperglass[3703797]: [CRITICAL] 20210808 22:41:53 | hyperglass.exceptions:40 | init → [DANGER] Error connecting to PER01 - Perth, WA: Request timed out.

thatmattlove commented 3 years ago

Does adjusting request_timeout affect the outcome of this at all? I'm trying to figure out if you're saying the timeout occurs while getting the output from the router, or after. If before, we may just need to increase the default timeout. It'd be nice if there was a way to paginate the request, i.e., show route table inet.0 <dst> | limit 100 or something like that, but that doesn't seem to exist.

@jakeb91 - it's possible we can do customizable per-command timeouts as part of #110

jakeb91 commented 3 years ago

@thatmattlove In my case - the official word from Arista is that it's a limitation of the way their BGP daemon works (the whole BGP table in the queried VRF needs to be enumerated to determine which neighbor each route belongs to when doing any kind of community or as-path based filtering in a query - this cannot be worked around / bypassed). I think the per-command timeout will be a handy feature but, in this case these queries can take 10+ minutes to resolve and I can't imagine a user waiting that long anyway - I may just disable them until the issue is fixed in EOS.

I can confirm in my case that request_timeout doesn't affect the nature of the problem - only the netmiko_delay_factor does.