sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
711 stars 1.36k forks source link

[SFLOW] [error log] ERR sflow#port_index_mapper: <built-in function Select_select> returned a result with an error set #11711

Open dellwuchuan opened 2 years ago

dellwuchuan commented 2 years ago

Description

When I enable SFLOW configuration as background configuration which is existing with other configurations, such as VLAN and port. sudo config reload -y, could trigger sflow error log - ERR sflow#port_index_mapper: returned a result with an error set One of my colleagues had some investigation on this issue, I hope it could provide some advantage. The log was seen during conflg reload and the port_index_mapper process recieved an interrupt.

Aug 10 05:08:47.850139 r-ocelot-07 NOTICE sflow#port_index_mapper: got signal 15 Aug 10 05:08:47.851442 r-ocelot-07 ERR sflow#port_index_mapper: returned a result with an error set Aug 10 05:08:47.851766 r-ocelot-07 INFO sflow#/supervisord: port_index_mapper File "/usr/bin/port_index_mapper.py", line 116, in Aug 10 05:08:47.851766 r-ocelot-07 INFO sflow#/supervisord: port_index_mapper main() Aug 10 05:08:47.851802 r-ocelot-07 INFO sflow#/supervisord: port_index_mapper File "/usr/bin/port_index_mapper.py", line 108, in main Aug 10 05:08:47.851802 r-ocelot-07 INFO sflow#/supervisord: port_index_mapper port_mapper.listen() Aug 10 05:08:47.851802 r-ocelot-07 INFO sflow#/supervisord: port_index_mapper File "/usr/bin/port_index_mapper.py", line 71, in listen Aug 10 05:08:47.851822 r-ocelot-07 INFO sflow#/supervisord: port_index_mapper (state, c) = self.sel.select(SELECT_TIMEOUT_MS) Aug 10 05:08:47.851822 r-ocelot-07 INFO sflow#/supervisord: port_index_mapper File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 1879, in select Aug 10 05:08:47.851852 r-ocelot-07 INFO sflow#/supervisord: port_index_mapper return _swsscommon.Select_select(self, timeout) But this log indicate something is wrong with SWIG or the swsscommon lib in terms of exception handling.

This is harmless, but it's better that this is documented in the community.

Ref regarding the error log seen: https://stackoverflow.com/questions/53796264/systemerror-class-int-returned-a-result-with-an-error-set-in-python

In general:

"[R]eturned a result with an error set" is something that can only be done at the C level. i.e. the C function sets an exception, but then return some value other than NULL.

Steps to reproduce the issue:

  1. Configure sflow function sudo config feature state sflow enabled sudo config sflow enable sudo config sflow collector add collector0 50.0.0.2 --port 6343 --vrf default sudo config sflow interface disable all sudo config sflow interface enable Ethernet128 sudo config sflow interface enable Ethernet248 2.Wait 1 minute for sflow docker startup 3.Config basic port/vlan/portchannel configuration

Describe the results you received:

In the system log of sonic switch, one sflow error log could be catched during config reload -y: ERR sflow#port_index_mapper: returned a result with an error set

Describe the results you expected:

There should be no sflow error log

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

zhangyanzhao commented 2 years ago

@venkatmahalingam can you please help to find someone in Dell to take a look? Thanks.

vivekrnv commented 2 years ago

When the port_index_mapper receives a signal 15 during config-reload, it immediately exits with a zero exit code https://github.com/sonic-net/sonic-buildimage/blob/master/dockers/docker-sflow/port_index_mapper.py#L102. At this point the select loop is still running. I'm not sure how this signal is propagated to the SWIG lib/C++ sources and how it is handled there.

But other python daemons which use swsscommon.select usually exit with a non-zero exit code when they recieve an interrupt (eg: https://github.com/sonic-net/sonic-host-services/blob/master/scripts/hostcfgd#L78) or break the select loop using a global flag https://github.com/sonic-net/sonic-buildimage/blob/master/src/sonic-bgpcfgd/bgpcfgd/runner.py#L53. I'm not sure which is the right way to handle an interrupt. Is there a preferred way to do so?

liat-grozovik commented 1 year ago

@venkatmahalingam any update on ETA when such a fix can be avaialble?

venkatmahalingam commented 1 year ago

@padmanarayana Please comment on this issue.

venkatmahalingam commented 1 year ago

@jeff-yin FYI.

venkatmahalingam commented 1 year ago

@Gokulnath-Raja Please update the latest status on this bug.