Closed vnam1 closed 6 years ago
Caused outage in our local network and had to reboot the box to recover.
Hi @vnam1 can you provide details about the opx-show-version
and opx-show-system-status
output, and if any packages are modified, the output of opx-show-packages
as well.
Also will need information about any applications you have installed that are performing the CPS transactions shown in the syslog.
admin@r101-l1:~$ sudo opx-show-system-status s sudo: unable to resolve host r101-l1 System State: degraded Failed Services opx-nas.service No Modified Package
--
root@r101-l1:/var/log/aos# opx-show-version OS_NAME="OPX" OS_VERSION="2.3.0" PLATFORM="S3048-ON" ARCHITECTURE="x86_64" INTERNAL_BUILD_ID="OpenSwitch blueprint for Dell 1.0.0" BUILD_VERSION="2.3.0.0" BUILD_DATE="2018-03-27T20:07:19-0700" INSTALL_DATE="2018-06-15T02:27:13+0000" SYSTEM_UPTIME= 1 hour, 7 minutes SYSTEM_STATE= running
--
libopx-nas-l3-1 3.2.0+opx2 3.2.0+opx2 No This package contains base layer 3 functionality for the Openswitch software.
libopx-nas-l2-1 1.17.0+opx1 1.17.0+opx1 No This package contains L2 Network abstraction features
opx-platform-config-dell-s3048 1.4.2 1.4.2 No Platform-specific files and packages for Dell Networking S3048-ON
opx-kernel-dell-3.16.0-4-amd64 3.16.43-2+deb8u2 3.16.43-2+deb8u2 No Linux 3.16 for 64-bit PCs
libopx-logging1 2.1.1 2.1.1 No Openswitch logging - shared library
opx-tools 1.0.1 1.0.1 No Openswitch tools which will be used in OPX Package Management
libopx-nas-common1 6.1.0+opx3 6.1.0+opx3 No This package contains common utilities for the network abstraction component.
libopx-common1 1.4.1 1.4.1 No This package contains general utilities for the Openswitch project.
opx-tmpctl 2.1.1 2.1.1 No Files common to all platforms
libopx-nas-qos1 2.2.0+opx2 2.2.0+opx2 No This package contains base QoS functionality for the Openswitch software.
opx-nas-interface 5.10.1+opx11 5.10.1+opx11 No This package contains nas-interface for the Openswitch software.
python-opx-logging 2.1.1 2.1.1 No Open Switch logging - python binding
libopx-cps1 3.6.2+opx4 3.6.2+opx4 No This package contains the Openswitch object library. The object library is provides APIs to set/get/delete/create and execute RPC calls using data models provided by components.
libopx-nas-ndi1 3.26.0+opx2 3.26.0+opx2 No This package contains nas-ndi code
libopx-nas-linux1 5.11.0+opx4 5.11.0+opx4 No This package contains the Linux integration portion of the Network abstractions service.
opx-dell-modules 1.2.0 1.2.0 No Kernel modules for Dell S6000 platform.
opx-nas-l3 3.2.0+opx2 3.2.0+opx2 No This package contains base layer 3 functionality for the Openswitch software.
opx-nas-l2 1.17.0+opx1 1.17.0+opx1 No This package contains L2 Network abstraction features
opx-nas-common 6.1.0+opx3 6.1.0+opx3 No This package contains common utilities for the network abstraction component.
libopx-nas-interface1 5.10.1+opx11 5.10.1+opx11 No This package contains nas-interface for the Openswitch software.
opx-nas-daemon 2.1.0+opx3 2.1.0+opx3 No This package contains the Network Abstraction service.
opx-bcm-lkm 2.1.2 2.1.2 No OPX kernel modules
libopx-sdi-device-drivers1 4.18.0 4.18.0 No This package contains the device driver plugin to be used with opx-sdi-sys package.
opx-core 2.3.0 2.3.0 No Core (platform-independent) OPX implementation
python-opx-snmp 1.0.3 1.0.3 No SNMP agent and handlers
libopx-sai-bcm-esw1 21.1.0+opx4 21.1.0+opx4 No This package implements the NPU plugin APIs to interface Broadcom ESW with the Dell SAI implementation.
libopx-nas-acl1 2.4.0+opx3 2.4.0+opx3 No This package contains base ACL functionality for the Openswitch software.
libopx-sdi-sys1 4.18.0 4.18.0 No This package contains implementation of System Device Interface(SDI)
opx-dell-s3048 1.4.2 1.4.2 No All packages for Dell Networking S3048-ON
opx-nas-acl 2.4.0+opx3 2.4.0+opx3 No This package contains base ACL scripts for the Openswitch software.
python-opx-cps 3.6.2+opx4 3.6.2+opx4 No This package contains the Openswitch object library. The object library is provides APIs to set/get/delete/create and execute RPC calls using data models provided by components.
libopx-base-model1 3.112.0 3.112.0 No Openswitch base model - object models for all services
opx-udrvr-ethernet 1.0.2 1.0.2 No DPDK driver for management interface (adding support for BCM54626 PHY)
opx-alarm 1.0.2 1.0.2 No Files common to all platforms opx-logging 2.1.1 2.1.1 No Openswitch logging - logging utilies
libopx-sdi-framework1 4.18.0 4.18.0 No This package contains common framework for System Device Interface(SDI)
opx-pas 3.0.1+opx3 3.0.1+opx3 No This package contains the Platform Adaptation Service.
opx-nas-linux 5.11.0+opx4 5.11.0+opx4 No This package contains the Linux integration portion of the Network abstractions service.
opx-nas-qos 2.2.0+opx2 2.2.0+opx2 No This package contains base QoS functionality for the Openswitch software.
libopx-pas1 3.0.1+opx3 3.0.1+opx3 No This package contains the Platform Adaptation Service.
opx-cps 3.6.2+opx4 3.6.2+opx4 No This package contains the Openswitch object library. The object library is provides APIs to set/get/delete/create and execute RPC calls using data models provided by components.
System was up for an hour and shows a failure of opx-nas.service? What does systemctl status opx-nas.service
yield?
opx-show-version is taken after reboot
actual uptime was 7days
Understood, but even after reboot it looks like there is a service failure. systemctl status -l opx-nas.service
output might help explain why.
Oh, wait. It looks like opx-show-system-status
is showing system state as "degraded" but opx-show-version
is showing system state as "running."
admin@r101-l1:~$ sudo opx-show-system-status s
sudo: unable to resolve host r101-l1
System State: degraded
Failed Services
opx-nas.service
No Modified Package
--
root@r101-l1:/var/log/aos# opx-show-version
OS_NAME="OPX"
OS_VERSION="2.3.0"
PLATFORM="S3048-ON"
ARCHITECTURE="x86_64"
INTERNAL_BUILD_ID="OpenSwitch blueprint for Dell 1.0.0"
BUILD_VERSION="2.3.0.0"
BUILD_DATE="2018-03-27T20:07:19-0700"
INSTALL_DATE="2018-06-15T02:27:13+0000"
SYSTEM_UPTIME= 1 hour, 7 minutes
SYSTEM_STATE= running
Were these two commands executed at different times? The system states should not be mismatched AFAIK...
Never mind, I got it. Seems like they were indeed executed at different times, with opx-show-version
taken after reboot. If you run into the failure again, the systemctl status -l opx-nas.service
might help. Will continue to triage from here...
FYI: we also run the following commands every couple of minutes for monitoring in steady state.
opx-show-stats if_stat
lldpctl -f json
brctl show
brctl showmacs
We have another box as well where high mem usage is seen: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1170 root 20 0 4268044 2.161g 64828 S 4.6 27.7 2815:42 opx_nas_daemon
root@leaf-1:~# uptime 23:41:18 up 19 days, 4:59, 5 users, load average: 0.18, 0.11, 0.09
Would it be possible to turn off opx-show-stats monitoring for a while? Since that is handled by NAS, if the memory usage does not increase further, then we know that that command is the culprit.
When I stop our process that does 'opx-show-stats if_stat' periodically, memory usage stays the same. When I restart it starts growing again. Growth seems to be around 1 - 2k every second. @nchervela would you be able to try this out to confirm the behavior I see?
@vnam1 I'll try it out today
@nchervela did you find anything on this? This is blocking us since some of the switches that we use (s3048) have low memory (2G) and the memory leak is causing the network outage.
Thanks
@vnam1 I started the script Friday, but had some issues with the system this morning. My OPX switch rebooted without any obvious reason. So, am restarting the test again.
@nchervela , Is it possible to run through valgrind or something to see the leak? This seems to leak with every call to opx-show-stats. This has become a complete showstopper for us having to monitor and restart opx_nas on all switches periodically.
@vnam1, @nchervela will probably send out a more detailed update tomorrow, but wanted to give you an early heads up that she was able to reproduce the issue on OPX 2.3.0, but not on OPX 2.3.1.
Thanks @jeff-yin @nchervela for quickly finding the issue. Really appreciate it. Question: is this issue platform specific?
@vnam1, As @jeff-yin mentioned, the issue was reproduced on OPX 2.3.0. Also verified that the issue no longer exists in 2.3.1. So upgrading to the next release should take care of this issue for you.
This is not platform specific. I reproduced on S6000, and you saw it on S3048 platform.
is this sufficient information to close this ticket?
@nchervela , let me upgrade to 2.3.1 and verify before closing. Just wanted to make sure none our use cases is triggering the leak.
@vnam1 sure
@nchervela Is there a bug-id that fixed the issue in 2.3.1? Just curious on the root cause.
@nchervela Also, is restarting opx-nas network impacting? Wanted to know if there is a way to restart if we see this bug.
@vnam1 we spoke to one of the NAS developers and he said that would not be possible. NAS does not support hitless restart at this time.
@vnam1 @nchervela will get back to you on the fix for 2.3.1. The issue wasn't explicitly reported against 2.3.0 nor was it reported internally; it was most likely fixed collaterally.
@jeff-yin Thanks for the info.
root cause would pretty much have to be memory being allocated and subsequently not freed. this issue could likely have been fixed by https://github.com/open-switch/opx-nas-interface/commit/2412a22162cca8e42e312b2180d2556f29c9add6. we are confident that the issue is fixed in 2.3.1
@vnam1 I'm going to close this issue for now if it's alright with you.
Verified. Can be closed.
Observed this on a switch (Dell s3048)
opx-nas daemon consuming too much memory and causing OOM. Service did not recover after it died.
memory usage seen on another similar system after 7 days. High usage can be seen here
top - 11:19:48 up 7 days, 19:11, 4 users, load average: 7.48, 7.07, 7.03 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie %Cpu(s): 4.1 us, 40.5 sy, 0.8 ni, 16.3 id, 36.8 wa, 0.0 hi, 1.5 si, 0.0 st KiB Mem: 2039604 total, 1969608 used, 69996 free, 3360 buffers KiB Swap: 0 total, 0 used, 0 free. 58264 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 577 root 20 0 2878796 1.271g 0 S 5.3 65.3 1083:27 opx_nas_daemon
system uptime 10:49:55 up 7 days, 18:41, 4 users, load average: 7.30, 7.48, 7.05
syslog:
Jul 5 17:01:54 r101-l1 kernel: [643704.356321] Out of memory: Kill process 1395 (opx_nas_daemon) score 622 or sacrifice child Jul 5 17:01:54 r101-l1 kernel: [643704.365699] Killed process 1395 (opx_nas_daemon) total-vm:2813264kB, anon-rss:1304868kB, file-rss:0kB Jul 5 17:01:54 r101-l1 base_nbr_mgr_svc[1305]: [DSAPI:CPS IPC], Failed to read the receive header Jul 5 17:01:54 r101-l1 base_nbr_mgr_svc[1305]: [DSAPI:COMMIT], Failed to commit request at 0 out of 1 Jul 5 17:01:54 r101-l1 opx_cps_service[1300]: [DSAPI:NS], Added registration removed 1.4.1. Jul 5 17:01:55 r101-l1 base_nbr_mgr_svc[1305]: [NBR_MGR:NAS-NBR-UPD], CPS API Commit failed 1 Jul 5 17:01:55 r101-l1 opx_cps_service[1300]: [DSAPI:NS], Added registration removed 1.3. Jul 5 17:01:55 r101-l1 opx_cps_service[1300]: [DSAPI:NS], Added registration removed 1.4.2. Jul 5 17:01:55 r101-l1 opx_cps_service[1300]: [DSAPI:NS], Added registration removed 1.4.3.
opx-nas does not restart/recover after going down