open-switch / opx-nas-daemon

https://openswitch.net
1 stars 9 forks source link

opx-nas daemon crash because of OOM #28

Closed vnam1 closed 6 years ago

vnam1 commented 6 years ago

Observed this on a switch (Dell s3048)

opx-nas daemon consuming too much memory and causing OOM. Service did not recover after it died.

memory usage seen on another similar system after 7 days. High usage can be seen here

top - 11:19:48 up 7 days, 19:11, 4 users, load average: 7.48, 7.07, 7.03 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie %Cpu(s): 4.1 us, 40.5 sy, 0.8 ni, 16.3 id, 36.8 wa, 0.0 hi, 1.5 si, 0.0 st KiB Mem: 2039604 total, 1969608 used, 69996 free, 3360 buffers KiB Swap: 0 total, 0 used, 0 free. 58264 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 577 root 20 0 2878796 1.271g 0 S 5.3 65.3 1083:27 opx_nas_daemon

system uptime 10:49:55 up 7 days, 18:41, 4 users, load average: 7.30, 7.48, 7.05

syslog:

Jul 5 17:01:54 r101-l1 kernel: [643704.356321] Out of memory: Kill process 1395 (opx_nas_daemon) score 622 or sacrifice child Jul 5 17:01:54 r101-l1 kernel: [643704.365699] Killed process 1395 (opx_nas_daemon) total-vm:2813264kB, anon-rss:1304868kB, file-rss:0kB Jul 5 17:01:54 r101-l1 base_nbr_mgr_svc[1305]: [DSAPI:CPS IPC], Failed to read the receive header Jul 5 17:01:54 r101-l1 base_nbr_mgr_svc[1305]: [DSAPI:COMMIT], Failed to commit request at 0 out of 1 Jul 5 17:01:54 r101-l1 opx_cps_service[1300]: [DSAPI:NS], Added registration removed 1.4.1. Jul 5 17:01:55 r101-l1 base_nbr_mgr_svc[1305]: [NBR_MGR:NAS-NBR-UPD], CPS API Commit failed 1 Jul 5 17:01:55 r101-l1 opx_cps_service[1300]: [DSAPI:NS], Added registration removed 1.3. Jul 5 17:01:55 r101-l1 opx_cps_service[1300]: [DSAPI:NS], Added registration removed 1.4.2. Jul 5 17:01:55 r101-l1 opx_cps_service[1300]: [DSAPI:NS], Added registration removed 1.4.3.

opx-nas does not restart/recover after going down

vnam1 commented 6 years ago

Caused outage in our local network and had to reboot the box to recover.

jeff-yin commented 6 years ago

Hi @vnam1 can you provide details about the opx-show-version and opx-show-system-status output, and if any packages are modified, the output of opx-show-packages as well. Also will need information about any applications you have installed that are performing the CPS transactions shown in the syslog.

vnam1 commented 6 years ago

admin@r101-l1:~$ sudo opx-show-system-status s sudo: unable to resolve host r101-l1 System State: degraded Failed Services opx-nas.service No Modified Package

--

root@r101-l1:/var/log/aos# opx-show-version OS_NAME="OPX" OS_VERSION="2.3.0" PLATFORM="S3048-ON" ARCHITECTURE="x86_64" INTERNAL_BUILD_ID="OpenSwitch blueprint for Dell 1.0.0" BUILD_VERSION="2.3.0.0" BUILD_DATE="2018-03-27T20:07:19-0700" INSTALL_DATE="2018-06-15T02:27:13+0000" SYSTEM_UPTIME= 1 hour, 7 minutes SYSTEM_STATE= running

--

root@r101-l1:/var/log/aos# opx-show-packages Name Original Version Current Version Altered Description

libopx-nas-l3-1 3.2.0+opx2 3.2.0+opx2 No This package contains base layer 3 functionality for the Openswitch software.

libopx-nas-l2-1 1.17.0+opx1 1.17.0+opx1 No This package contains L2 Network abstraction features

opx-platform-config-dell-s3048 1.4.2 1.4.2 No Platform-specific files and packages for Dell Networking S3048-ON

opx-kernel-dell-3.16.0-4-amd64 3.16.43-2+deb8u2 3.16.43-2+deb8u2 No Linux 3.16 for 64-bit PCs

libopx-logging1 2.1.1 2.1.1 No Openswitch logging - shared library

opx-tools 1.0.1 1.0.1 No Openswitch tools which will be used in OPX Package Management

libopx-nas-common1 6.1.0+opx3 6.1.0+opx3 No This package contains common utilities for the network abstraction component.

libopx-common1 1.4.1 1.4.1 No This package contains general utilities for the Openswitch project.

opx-tmpctl 2.1.1 2.1.1 No Files common to all platforms

libopx-nas-qos1 2.2.0+opx2 2.2.0+opx2 No This package contains base QoS functionality for the Openswitch software.

opx-nas-interface 5.10.1+opx11 5.10.1+opx11 No This package contains nas-interface for the Openswitch software.

python-opx-logging 2.1.1 2.1.1 No Open Switch logging - python binding

libopx-cps1 3.6.2+opx4 3.6.2+opx4 No This package contains the Openswitch object library. The object library is provides APIs to set/get/delete/create and execute RPC calls using data models provided by components.

libopx-nas-ndi1 3.26.0+opx2 3.26.0+opx2 No This package contains nas-ndi code

libopx-nas-linux1 5.11.0+opx4 5.11.0+opx4 No This package contains the Linux integration portion of the Network abstractions service.

opx-dell-modules 1.2.0 1.2.0 No Kernel modules for Dell S6000 platform.

opx-nas-l3 3.2.0+opx2 3.2.0+opx2 No This package contains base layer 3 functionality for the Openswitch software.

opx-nas-l2 1.17.0+opx1 1.17.0+opx1 No This package contains L2 Network abstraction features

opx-nas-common 6.1.0+opx3 6.1.0+opx3 No This package contains common utilities for the network abstraction component.

libopx-nas-interface1 5.10.1+opx11 5.10.1+opx11 No This package contains nas-interface for the Openswitch software.

opx-nas-daemon 2.1.0+opx3 2.1.0+opx3 No This package contains the Network Abstraction service.

opx-bcm-lkm 2.1.2 2.1.2 No OPX kernel modules

libopx-sdi-device-drivers1 4.18.0 4.18.0 No This package contains the device driver plugin to be used with opx-sdi-sys package.

opx-core 2.3.0 2.3.0 No Core (platform-independent) OPX implementation

python-opx-snmp 1.0.3 1.0.3 No SNMP agent and handlers

libopx-sai-bcm-esw1 21.1.0+opx4 21.1.0+opx4 No This package implements the NPU plugin APIs to interface Broadcom ESW with the Dell SAI implementation.

libopx-nas-acl1 2.4.0+opx3 2.4.0+opx3 No This package contains base ACL functionality for the Openswitch software.

libopx-sdi-sys1 4.18.0 4.18.0 No This package contains implementation of System Device Interface(SDI)

opx-dell-s3048 1.4.2 1.4.2 No All packages for Dell Networking S3048-ON

opx-nas-acl 2.4.0+opx3 2.4.0+opx3 No This package contains base ACL scripts for the Openswitch software.

python-opx-cps 3.6.2+opx4 3.6.2+opx4 No This package contains the Openswitch object library. The object library is provides APIs to set/get/delete/create and execute RPC calls using data models provided by components.

libopx-base-model1 3.112.0 3.112.0 No Openswitch base model - object models for all services

opx-udrvr-ethernet 1.0.2 1.0.2 No DPDK driver for management interface (adding support for BCM54626 PHY)

opx-alarm 1.0.2 1.0.2 No Files common to all platforms opx-logging 2.1.1 2.1.1 No Openswitch logging - logging utilies

libopx-sdi-framework1 4.18.0 4.18.0 No This package contains common framework for System Device Interface(SDI)

opx-pas 3.0.1+opx3 3.0.1+opx3 No This package contains the Platform Adaptation Service.

opx-nas-linux 5.11.0+opx4 5.11.0+opx4 No This package contains the Linux integration portion of the Network abstractions service.

opx-nas-qos 2.2.0+opx2 2.2.0+opx2 No This package contains base QoS functionality for the Openswitch software.

libopx-pas1 3.0.1+opx3 3.0.1+opx3 No This package contains the Platform Adaptation Service.

opx-cps 3.6.2+opx4 3.6.2+opx4 No This package contains the Openswitch object library. The object library is provides APIs to set/get/delete/create and execute RPC calls using data models provided by components.

jeff-yin commented 6 years ago

System was up for an hour and shows a failure of opx-nas.service? What does systemctl status opx-nas.service yield?

vnam1 commented 6 years ago

opx-show-version is taken after reboot

vnam1 commented 6 years ago

actual uptime was 7days

jeff-yin commented 6 years ago

Understood, but even after reboot it looks like there is a service failure. systemctl status -l opx-nas.service output might help explain why.

jeff-yin commented 6 years ago

Oh, wait. It looks like opx-show-system-status is showing system state as "degraded" but opx-show-version is showing system state as "running."

admin@r101-l1:~$ sudo opx-show-system-status s
sudo: unable to resolve host r101-l1
System State: degraded
Failed Services
opx-nas.service
No Modified Package

--

root@r101-l1:/var/log/aos# opx-show-version
OS_NAME="OPX"
OS_VERSION="2.3.0"
PLATFORM="S3048-ON"
ARCHITECTURE="x86_64"
INTERNAL_BUILD_ID="OpenSwitch blueprint for Dell 1.0.0"
BUILD_VERSION="2.3.0.0"
BUILD_DATE="2018-03-27T20:07:19-0700"
INSTALL_DATE="2018-06-15T02:27:13+0000"
SYSTEM_UPTIME= 1 hour, 7 minutes
SYSTEM_STATE= running

Were these two commands executed at different times? The system states should not be mismatched AFAIK...

jeff-yin commented 6 years ago

Never mind, I got it. Seems like they were indeed executed at different times, with opx-show-version taken after reboot. If you run into the failure again, the systemctl status -l opx-nas.service might help. Will continue to triage from here...

vnam1 commented 6 years ago

FYI: we also run the following commands every couple of minutes for monitoring in steady state.

opx-show-stats if_stat lldpctl -f json brctl show brctl showmacs ip -4 neigh

vnam1 commented 6 years ago

We have another box as well where high mem usage is seen: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1170 root 20 0 4268044 2.161g 64828 S 4.6 27.7 2815:42 opx_nas_daemon

root@leaf-1:~# uptime 23:41:18 up 19 days, 4:59, 5 users, load average: 0.18, 0.11, 0.09

nchervela commented 6 years ago

Would it be possible to turn off opx-show-stats monitoring for a while? Since that is handled by NAS, if the memory usage does not increase further, then we know that that command is the culprit.

vnam1 commented 6 years ago

When I stop our process that does 'opx-show-stats if_stat' periodically, memory usage stays the same. When I restart it starts growing again. Growth seems to be around 1 - 2k every second. @nchervela would you be able to try this out to confirm the behavior I see?

nchervela commented 6 years ago

@vnam1 I'll try it out today

vnam1 commented 6 years ago

@nchervela did you find anything on this? This is blocking us since some of the switches that we use (s3048) have low memory (2G) and the memory leak is causing the network outage.

Thanks

nchervela commented 6 years ago

@vnam1 I started the script Friday, but had some issues with the system this morning. My OPX switch rebooted without any obvious reason. So, am restarting the test again.

vnam1 commented 6 years ago

@nchervela , Is it possible to run through valgrind or something to see the leak? This seems to leak with every call to opx-show-stats. This has become a complete showstopper for us having to monitor and restart opx_nas on all switches periodically.

jeff-yin commented 6 years ago

@vnam1, @nchervela will probably send out a more detailed update tomorrow, but wanted to give you an early heads up that she was able to reproduce the issue on OPX 2.3.0, but not on OPX 2.3.1.

vnam1 commented 6 years ago

Thanks @jeff-yin @nchervela for quickly finding the issue. Really appreciate it. Question: is this issue platform specific?

nchervela commented 6 years ago

@vnam1, As @jeff-yin mentioned, the issue was reproduced on OPX 2.3.0. Also verified that the issue no longer exists in 2.3.1. So upgrading to the next release should take care of this issue for you.

This is not platform specific. I reproduced on S6000, and you saw it on S3048 platform.

is this sufficient information to close this ticket?

vnam1 commented 6 years ago

@nchervela , let me upgrade to 2.3.1 and verify before closing. Just wanted to make sure none our use cases is triggering the leak.

nchervela commented 6 years ago

@vnam1 sure

vnam1 commented 6 years ago

@nchervela Is there a bug-id that fixed the issue in 2.3.1? Just curious on the root cause.

vnam1 commented 6 years ago

@nchervela Also, is restarting opx-nas network impacting? Wanted to know if there is a way to restart if we see this bug.

jeff-yin commented 6 years ago

@vnam1 we spoke to one of the NAS developers and he said that would not be possible. NAS does not support hitless restart at this time.

jeff-yin commented 6 years ago

@vnam1 @nchervela will get back to you on the fix for 2.3.1. The issue wasn't explicitly reported against 2.3.0 nor was it reported internally; it was most likely fixed collaterally.

vnam1 commented 6 years ago

@jeff-yin Thanks for the info.

nchervela commented 6 years ago

root cause would pretty much have to be memory being allocated and subsequently not freed. this issue could likely have been fixed by https://github.com/open-switch/opx-nas-interface/commit/2412a22162cca8e42e312b2180d2556f29c9add6. we are confident that the issue is fixed in 2.3.1

jeff-yin commented 6 years ago

@vnam1 I'm going to close this issue for now if it's alright with you.

vnam1 commented 6 years ago

Verified. Can be closed.