netenglabs / suzieq

Using network observability to operate and design healthier networks
https://www.stardustsystems.net/
Apache License 2.0
798 stars 107 forks source link

make it easier to know what service crashed in sqpoller #177

Open jopietsch opened 4 years ago

jopietsch commented 4 years ago

Is your feature request related to a problem? Please describe. if a service fails, it's hard to know what service crashed

this is what we see now which doesn't say which service failed

root@c8d842401bf3:/suzieq# sq-poller -D /suzieq/inventory 
Traceback (most recent call last):
  File "/root/.local/lib/python3.7/site-packages/suzieq/poller/sq-poller", line 191, in <module>
    asyncio.run(start_poller(userargs, cfg))
  File "/usr/local/lib/python3.7/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/root/.local/lib/python3.7/site-packages/suzieq/poller/sq-poller", line 131, in start_poller
    await asyncio.gather(*tasks)
  File "/root/.local/lib/python3.7/site-packages/suzieq/poller/services/service.py", line 578, in run
    result = self.process_data(output)
  File "/root/.local/lib/python3.7/site-packages/suzieq/poller/services/service.py", line 350, in process_data
    tmpres = self._process_each_output(i, item)
  File "/root/.local/lib/python3.7/site-packages/suzieq/poller/services/service.py", line 311, in _process_each_output
    norm_str, in_info)
  File "/root/.local/lib/python3.7/site-packages/suzieq/poller/services/svcparser.py", line 386, in cons_recs_from_json_template
    subele = subele.get(subfld)
AttributeError: 'str' object has no attribute 'get'
root@c8d842401bf3:/suzieq# 

this requires you to go to the debugger, which is too much to ask for users. We need to come up with some better reporting. In this case it was bgp that was crashing.

jopietsch commented 4 years ago
root> sqpoller show                                                                                                                                                                                          
    namespace             hostname     service  status gatherTime totalTime svcQsize wrQsize nodeQsize  pollExcdPeriodCount               timestamp
2   NX-OS_DC1  leaf101-N93180YC-EX       arpnd       0         []        []       []      []        []                    0 2020-06-19 10:52:41.794
4   NX-OS_DC1  leaf101-N93180YC-EX          fs     404         []        []       []      []        []                    0 2020-06-19 10:51:23.316
7   NX-OS_DC1  leaf102-N93180YC-EX       arpnd       0         []        []       []      []        []                    0 2020-06-19 10:51:33.048
10  NX-OS_DC1  leaf102-N93180YC-EX          fs     404         []        []       []      []        []                    0 2020-06-19 10:53:39.974
16  NX-OS_DC1  leaf106-N9348GC-FXP       arpnd       0         []        []       []      []        []                    0 2020-06-19 10:53:39.907
23  NX-OS_DC1  leaf106-N9348GC-FXP          fs     404         []        []       []      []        []                    0 2020-06-19 10:51:52.697
27  NX-OS_DC1  leaf107-N9348GC-FXP       arpnd       0         []        []       []      []        []                    0 2020-06-19 10:52:41.895
33  NX-OS_DC1  leaf107-N9348GC-FXP          fs     404         []        []       []      []        []                    0 2020-06-19 10:51:52.785
37  NX-OS_DC2  leaf103-N93108TC-EX       arpnd       0         []        []       []      []        []                    0 2020-06-19 10:51:33.048
38  NX-OS_DC2  leaf103-N93108TC-EX         bgp       0         []        []       []      []        []                    0 2020-06-19 10:51:32.989
39  NX-OS_DC2  leaf103-N93108TC-EX          fs     404         []        []       []      []        []                    0 2020-06-19 10:51:33.045
46  NX-OS_DC2  leaf104-N93108TC-EX       arpnd       0         []        []       []      []        []                    0 2020-06-19 10:52:02.887
50  NX-OS_DC2  leaf104-N93108TC-EX          fs     404         []        []       []      []        []                    0 2020-06-19 10:52:26.287
52  NX-OS_DC2  leaf104-N93108TC-EX  ifCounters     404         []        []       []      []        []                    0 2020-06-19 10:53:05.113
54  NX-OS_DC2  leaf105-N93108TC-EX       arpnd       0         []        []       []      []        []                    0 2020-06-19 10:51:33.195
64  NX-OS_DC2  leaf105-N93108TC-EX         bgp      16         []        []       []      []        []                    0 2020-06-19 10:51:52.252
74  NX-OS_DC2  leaf105-N93108TC-EX     evpnVni      16         []        []       []      []        []                    0 2020-06-19 10:52:13.775
84  NX-OS_DC2  leaf105-N93108TC-EX          fs     404         []        []       []      []        []                    0 2020-06-19 10:52:41.420
root> sqpoller show status=fail                                                                                                                                                                              
    namespace             hostname     service  status gatherTime totalTime svcQsize wrQsize nodeQsize  pollExcdPeriodCount               timestamp
4   NX-OS_DC1  leaf101-N93180YC-EX          fs     404         []        []       []      []        []                    0 2020-06-19 10:51:23.316
10  NX-OS_DC1  leaf102-N93180YC-EX          fs     404         []        []       []      []        []                    0 2020-06-19 10:53:39.974
23  NX-OS_DC1  leaf106-N9348GC-FXP          fs     404         []        []       []      []        []                    0 2020-06-19 10:51:52.697
33  NX-OS_DC1  leaf107-N9348GC-FXP          fs     404         []        []       []      []        []                    0 2020-06-19 10:51:52.785
39  NX-OS_DC2  leaf103-N93108TC-EX          fs     404         []        []       []      []        []                    0 2020-06-19 10:51:33.045
50  NX-OS_DC2  leaf104-N93108TC-EX          fs     404         []        []       []      []        []                    0 2020-06-19 10:52:26.287
52  NX-OS_DC2  leaf104-N93108TC-EX  ifCounters     404         []        []       []      []        []                    0 2020-06-19 10:53:05.113
64  NX-OS_DC2  leaf105-N93108TC-EX         bgp      16         []        []       []      []        []                    0 2020-06-19 10:51:52.252
74  NX-OS_DC2  leaf105-N93108TC-EX     evpnVni      16         []        []       []      []        []                    0 2020-06-19 10:52:13.775]
84  NX-OS_DC2  leaf105-N93108TC-EX          fs     404         []        []       []      []        []                    0 2020-06-19 10:52:41.420

what does the 16 status code mean for bgp and evpn? Would that have been the right indicator that bgp was failing?