sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
737 stars 1.42k forks source link

[xcvrd][sfputil] sfputil response is extremely slow when used upon on a xcvr of type cmis #12133

Open vivekrnv opened 2 years ago

vivekrnv commented 2 years ago

Description

The inefficiency is in SfpBase, xcvr_mem_maps etc and this also affects xcvrd, since both xcvrd and sfputil use the same api's of SfpBase such as get_transceiver_info & get_transceiver_bulk_status & get_transceiver_threshold_info. On a device with 30 front-panel ports and 30 QSFP-DD xcvrs, i've seen pmon CPU usage reaching upto 35% with a period of 10-20 sec. pmon usage can get progressively worse if we have multiple front panel ports

Steps to reproduce the issue:

  1. Plug in a cable of type CMIS eg: QSFP-DD
  2. Run sfputil

Describe the results you received:

root@r-leopard-58:/home/admin# time sfputil show eeprom -p Ethernet0
Cannot get Module EEPROM data: Invalid argument
Ethernet0: SFP EEPROM detected
        Active Firmware Version: 0.0
        CMIS Revision: 4.0
        Identifier: QSFP-DD Double Density 8X Pluggable Transceiver
                Specification compliance: passive_copper_media_interface
        Vendor Date Code(YYYY-MM-DD Lot): 2020-12-19
        Vendor Name: Mellanox
        Vendor OUI: 00-02-c9
        Vendor PN: MCP1660-W00AE30
        Vendor Rev: A3
        Vendor SN: MT2051VS03513

real    0m4.875s
user    0m1.179s
sys     0m0.562s

In comparison:

QFFP-28
root@r-leopard-58:/home/admin# time sfputil show eeprom -p Ethernet248
Ethernet248: SFP EEPROM detected
        Application Advertisement: N/A
        Connector: No separable connector
        Encoding: 64B/66B
        Extended Identifier: Power Class 1 Module (1.5W max.), No CLEI code present in Page 02h, No CDR in TX, No CDR in RX
        Extended RateSelect Compliance: Unknown
        Identifier: QSFP28 or later
        Length Cable Assembly(m): 2.0
        Nominal Bit Rate(100Mbs): 255
        Specification compliance:
                10/40G Ethernet Compliance Code: Unknown
                Extended Specification Compliance: 100GBASE-CR4, 25GBASE-CR CA-25G-L or 50GBASE-CR2 with RS
                Fibre Channel Link Length: Unknown
                Fibre Channel Speed: Unknown
                Fibre Channel Transmission Media: Unknown
                Fibre Channel Transmitter Technology: Unknown
                Gigabit Ethernet Compliant Codes: 1000BASE-CX
                SAS/SATA Compliance Codes: Unknown
                SONET Compliance Codes: Unknown
        Vendor Date Code(YYYY-MM-DD Lot): 2016-12-31
        Vendor Name: Mellanox
        Vendor OUI: 00-02-c9
        Vendor PN: MCP7H00-G01AR
        Vendor Rev: A1
        Vendor SN: MT1710VS04177

real    0m0.691s
user    0m0.275s
sys     0m0.110s

Triage

A single get_transciever_info() is resulting in 31 calls to read_eeprom and the read_eeprom for a lot of platforms uses either a subprocess call or a file open/read operations. Thus making it extremely slow. Calling get_transciever_domI() can result in an addition of 40+ calls to read eeprom. Note: These stats were taken for MSN4700 platform

root@r-leopard-01:/home/admin# python3 -m cProfile -s tottime /usr/local/bin/sfputil show eeprom -p Ethernet0  | grep eeprom
root@r-leopard-01:/home/admin# cat pre_opt.txt
       31    0.002    0.000    4.387    0.142 sfp.py:350(_read_eeprom_specific_bytes)
       29    0.001    0.000    4.059    0.140 xcvr_eeprom.py:15(read)
       31    0.000    0.000    4.387    0.142 sfp.py:374(read_eeprom)
        1    0.000    0.000    4.411    4.411 main.py:611(eeprom)
       29    0.000    0.000    0.000    0.000 xcvr_eeprom.py:29(<dictcomp>)
        1    0.000    0.000    0.000    0.000 eeprom_dts.py:3(<module>)
        1    0.000    0.000    0.000    0.000 xcvr_eeprom.py:10(__init__)
        1    0.000    0.000    0.000    0.000 xcvr_eeprom.py:1(<module>)
        1    0.000    0.000    0.000    0.000 xcvr_eeprom.py:9(XcvrEeprom)

SfpBase, Xcvr_Api, MemMap and the associated classed must be optimized. Ideal optimization target should be to drastically reduce calls to read_eeprom.

dgsudharsan commented 2 years ago

@prgeor Can you please provide an ETA for the fix?

prgeor commented 2 years ago

@dgsudharsan there is an inherent issue where mlnx platform make several ethool command call via process call that make sfputil much slower in mlnx platform. Do you still see the issue after this fix

vivekrnv commented 2 years ago

@dgsudharsan there is an inherent issue where mlnx platform make several ethool command call via process call that make sfputil much slower in mlnx platform. Do you still see the issue after this fix

That fix significantly reduces the response time but the current approach still involves making multiple file open and read calls. I think SfpBase and the others can be optimized to reduce read_eeprom calls.

prgeor commented 1 year ago

@andywongarista lets discuss the fix for this SFP-refactor introduced issue