sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
723 stars 1.38k forks source link

[chassis] route_check fails on LC due to timeout on frr routes #18773

Open anamehra opened 4 months ago

anamehra commented 4 months ago

Description

On chassis, after the introduction of frr route check in route_check.py (https://github.com/sonic-net/sonic-utilities/pull/2762), route_check.py may take more than 2 mins to finish. The current timeout is 2 mins which causes route check to fail and affects monit output. This affects the sonic-mgmt pretest check. Other test cases relying on monit output may also be affected.

root@sfd-t2-lc0:/home/cisco# time route_check.py                                                                                                                                                                                                              [[BAborting routeCheck.py upon timeout signal after 120 seconds                                                                                                                                                                                            
[<FrameSummary file /usr/local/bin/route_check.py, line 810 in <module>>, <FrameSummary file /usr/local/bin/route_check.py, line 797 in main>, <FrameSummary file /usr/local/bin/route_check.py, line 745 in check_routes>, <FrameSummary file /usr/local/bin/│·
route_check.py, line 537 in check_frr_pending_routes>, <FrameSummary file /usr/local/bin/route_check.py, line 345 in get_frr_routes>, <FrameSummary file /usr/lib/python3.9/subprocess.py, line 424 in check_output>, <FrameSummary file /usr/lib/python3.9/su│·
bprocess.py, line 507 in run>, <FrameSummary file /usr/lib/python3.9/subprocess.py, line 1121 in communicate>, <FrameSummary file /usr/local/bin/route_check.py, line 95 in handler>]                                                                      
Traceback (most recent call last):                                                                                                                                                                                                                            
  File "/usr/local/bin/route_check.py", line 810, in <module>                                                                                                                                                                                                 
    sys.exit(main()[0])                                                                                                                                                                                                                                       
  File "/usr/local/bin/route_check.py", line 797, in main                                                                                                                                                                                                     
    ret, res= check_routes()                                                                                                                                                                                                                                  
  File "/usr/local/bin/route_check.py", line 745, in check_routes                                                                                                                                                                                             
    rt_frr_miss = check_frr_pending_routes()                                                                                                                                                                                                                  
  File "/usr/local/bin/route_check.py", line 537, in check_frr_pending_routes                                                                                                                                                                                 
    frr_routes = get_frr_routes()                                                                                                                                                                                                                             
  File "/usr/local/bin/route_check.py", line 345, in get_frr_routes                                                                                                                                                                                           
    output = subprocess.check_output('show ipv6 route json', shell=True)                                                                                                                                                                                      
  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output                                                                                                                                                                                          
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,                                                                                                                                                                                          
  File "/usr/lib/python3.9/subprocess.py", line 507, in run                                                                                                                                                                                                   
    stdout, stderr = process.communicate(input, timeout=timeout)                                                                                                                                                                                              
  File "/usr/lib/python3.9/subprocess.py", line 1121, in communicate                                                                                                                                                                                          
    stdout = self.stdout.read()                                                                                                                                                                                                                               
  File "/usr/local/bin/route_check.py", line 96, in handler                                                                                                                                                                                                   
    raise Exception("timeout occurred")                                                                                                                                                                                                                       
Exception: timeout occurred                                                                                                                                                                                                                                   

real    2m0.714s                                                                                                                                                                                                                                              
user    0m57.700s                                                                                                                                                                                                                                             
sys     0m2.939s 

The issue was opened for 202305 earlier which was fixed by reverting the feature for frr route check: https://github.com/sonic-net/sonic-buildimage/issues/17403

This needs to be fixed for master.

Steps to reproduce the issue:

1. 2. 3.

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

bingwang-ms commented 4 months ago

The issue will be triaged further in the chassis meeting

abdosi commented 4 months ago

@stephenxs @stepanblyschak @liat-grozovik : can you please help with this.

abdosi commented 4 months ago

@judyjoseph @arlakshm @mlok-nokia @ysmanman for viz. Will apply for master image also.

arlakshm commented 4 months ago

Feature 'Install before advt.' might be disable for 202405.

stepanblyschak commented 4 months ago

@anamehra Could you please share a tech support when the issue occurs? What is the route scale on the system? If you have an opportunity to play with the system, could you please increase the timeout to 1h and check whether route_check.py eventually finishes or is stuck without progress?

anamehra commented 3 months ago

@anamehra Could you please share a tech support when the issue occurs? What is the route scale on the system? If you have an opportunity to play with the system, could you please increase the timeout to 1h and check whether route_check.py eventually finishes or is stuck without progress?

The route_check eventually finished. I saw it took a couple of more mins. We have 50K routes. I will check on show tech.

rlhui commented 3 months ago

this is currently still an issue with 202405

mannytaheri commented 1 month ago

@deepak-singhal0408 - I have attached logs for routeCheck issue. routeCheck_logs.txt

deepak-singhal0408 commented 4 days ago

this feature is enabled back in Master. https://github.com/sonic-net/sonic-buildimage/pull/19836

deepak-singhal0408 commented 4 days ago

Tried 2 iterations with device having 32k v4+32k v6 routes..

Neighbhor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd NeighborName


10.0.0.1 4 65200 61249 14398 0 0 0 01:02:57 1 ARISTA01T3 10.0.0.5 4 65200 0 0 0 0 0 never Active ARISTA03T3 10.0.0.7 4 65200 6059 5857 0 0 0 4d00h24m 1 ARISTA04T3 10.0.0.11 4 65200 6056 5856 0 0 0 4d00h24m 33793 ARISTA06T3

Iteration1: <<<<<<<<<<<<<<<<< Checking routes for namespaces: ['asic0', 'asic1']

real 3m16.387s user 1m26.084s sys 0m7.275s

Iteration2: <<<<<<<<<<<<<<<<<<<<<<<<< Checking routes for namespaces: ['asic0', 'asic1']

real 3m18.249s user 1m26.760s sys 0m7.926s

deepak-singhal0408 commented 4 days ago

python -m cProfile -s time route_check.py 122726378 function calls (82385912 primitive calls) in 216.529 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function) 6 90.089 15.015 90.089 15.015 {built-in method time.sleep} 14 82.537 5.896 82.653 5.904 {method 'read' of '_io.TextIOWrapper' objects} 51279296/15794766 10.061 0.000 15.341 0.000 encoder.py:333(_iterencode_dict) 2 6.252 3.126 6.252 3.126 {built-in method swsscommon._swsscommon.new_SubscriberStateTable} 12 4.621 0.385 4.621 0.385 decoder.py:343(raw_decode) 20647482/15794694 3.588 0.000 10.100 0.000 encoder.py:277(_iterencode_list) 106 2.978 0.028 2.978 0.028 {method 'format' of 'str' objects} 15794766 2.714 0.000 18.055 0.000 encoder.py:413(_iterencode) 9 1.360 0.151 19.613 2.179 encoder.py:182(encode) 12982522 1.148 0.000 1.148 0.000 {built-in method builtins.isinstance} 205278 0.854 0.000 1.632 0.000 ipaddress.py:1603(_ip_int_from_string) 4736970 0.720 0.000 0.720 0.000 {built-in method _json.encode_basestring_ascii} 821056 0.655 0.000 0.891 0.000 ipaddress.py:1201(_parse_octet) 410527 0.453 0.000 2.253 0.000 ipaddress.py:1269(init) 410514 0.446 0.000 1.700 0.000 ipaddress.py:1175(_ip_int_from_string) 615687 0.381 0.000 0.666 0.000 ipaddress.py:1707(_parse_hextet) 2 0.374 0.187 180.955 90.478 route_check.py:520(check_frr_pending_routes) <<<<<<<<<<<<<<<<< 205295 0.316 0.000 2.137 0.000 ipaddress.py:1875(init) 139211 0.288 0.000 0.289 0.000 {method 'join' of 'str' objects} 273646 0.288 0.000 3.931 0.000 route_check.py:165(is_local) 1231834 0.285 0.000 0.285 0.000 {method 'split' of 'str' objects}

deepak-singhal0408 commented 4 days ago

With following optimizations, route_check time is reduced to 1m30sec.

  1. Parallel execution for each asic namespace
  2. parallel fetching of routes for v4 and v6 time route_check.py real 1m30.675s user 1m33.777s sys 0m8.209s