Snimpy Memory Leak - Githubissues

hanynowsky commented 9 years ago

Using Snimpy in some Celery Tasks, the server RAM consumption grows non-stop to the point kernel starts killing Celery workers process (oom). After running Valgrind against a simple Snimpy Script, Memory leak is detected:

valgrind --leak-check=yes --track-origins=yes python sdsl.py

==10762== LEAK SUMMARY:
==10762==    definitely lost: 609 bytes in 45 blocks
==10762==    indirectly lost: 0 bytes in 0 blocks
==10762==      possibly lost: 3,456 bytes in 6 blocks
==10762==    still reachable: 1,192,945 bytes in 5,951 blocks
==10762==         suppressed: 0 bytes in 0 blocks
==10762== Reachable blocks (those to which a pointer was found) are not shown.
==10762== To see them, rerun with: --leak-check=full --show-reachable=yes
==10762== 
==10762== For counts of detected and suppressed errors, rerun with: -v
==10762== ERROR SUMMARY: 4499 errors from 134 contexts (suppressed: 103 from 9)


#!/usr/bin/python
#import traceback
import os
import sys
from snimpy.manager import Manager as M
from snimpy.manager import load

def sdsl_test(hostname, community):
        mib_path = os.path.join('/home/masac/masac', 'scripts', 'mib')
        load(os.path.join(mib_path, 'SNMPv2-SMI.txt'))
        load(os.path.join(mib_path, 'SNMPv2-TC.txt'))
        load(os.path.join(mib_path, 'SNMPv2-CONF.txt'))
        load(os.path.join(mib_path, 'SNMPv2-MIB.txt'))
        load(os.path.join(mib_path, 'IANAifType-MIB.txt'))
        load(os.path.join(mib_path, 'IF-MIB.txt'))
        """ If you put cache to a value other than 0, you get RuntimeError relative to Dictionary size change"""
        m = M(host=hostname, retries=0, community=community, version=2, timeout=1, cache=0)
        uptime = 0
        try:
                with m:
                        uptime = m.sysUpTime
                        try:
                                for ix in m.ifDescr:
                                        desc = m.ifDescr[ix]
                                        metric = m.ifHCInOctets[ix]
                                        print(desc + " " + str(metric))
                        except Exception as e:
                                print(e)
        except Exception as e:
                print(e)
        finally:
                if uptime:
                        print(uptime)
sdsl_test('Router.IP','my-community')
sys.exit()

If needed, I can provide code of the Celery Task using Snimpy. basically the task loop over a list of network routers, fetches metrics and send them to a server.

vincentbernat commented 9 years ago

Memory leaks are expected as I did use Snimpy only in one-time scripts. I will look at that and fix them. Thanks for reporting!

I see that you noticed another bug, don't hesitate to open an issue for it.

vincentbernat commented 9 years ago

Just before sys.exit(), could you add:

import snimpy.mib
snimpy.mib._smi.smiExit()

This should remove any potential leaks from libsmi (but won't solve your problem).

What are your Celery tasks lifetime? Is it like enclosing your code in a for loop? Or is it doing something like reloading modules?

hanynowsky commented 9 years ago

Thank you fore your reactiveness. Well, as you said, the _smi.smiExit() does not help. The nature of Python garbage collection makes this leak happen I guess And celery being an asynchronous task processor, does not help neither :) The celery Task is crontabbed each minute or each five minutes. Here is the source:


from __future__ import absolute_import
from celery import shared_task
from celery.task import task
from django.core.management.base import BaseCommand, CommandError
from celery.decorators import periodic_task
from celery.schedules import crontab
import platform
import socket
import time
import sys 
import os
import requests
import json
from django.conf import settings
from snimpy.manager import Manager as M
from snimpy.manager import load
from snimpy.snmp import SNMPNoSuchInstance
from snimpy.snmp import SNMPException
from snimpy.mib import SMIException
from socket import gaierror
from pyasn1.type.error import ValueConstraintError
import os
import traceback
from masac.feed.models import CiscoEquipment
from masac.feed.models import Routers
from masac.feed.models import Switches
import pickle
from copy import deepcopy
from copy import copy
import sqlite3 as lite
import inspect
from os.path import expanduser

LOG_DEBUG = settings.LOG_DEBUG
CARBON_SERVER = settings.CARBON_SERVER # Server where Cyanite is running
CARBON_PORT = settings.CARBON_PORT
SDSL_TIMEOUT = settings.SDSL_TIMEOUT
SWITCH_TIMEOUT = settings.SWITCH_TIMEOUT
ROUTER_TIMEOUT = settings.ROUTER_TIMEOUT
@periodic_task(run_every=(crontab(hour="*", minute="*/3", day_of_week="*")), queue='masac.switch', options={'queue': 'masac.switch', 'routing_key' : 'masac.switch'})
def send_switch_metrics():
    try:
        node = platform.node().replace('.', '-')
        timestamp = int(time.time())
        switch_list =  Switches.objects.filter(marque='switch', banned=0)
        DELAY = 1
        #sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock = socket.socket()
        sock.connect((CARBON_SERVER, CARBON_PORT))
        message = ''
        mib_path = os.path.join(settings.BASE_DIR, 'scripts', 'mib')
        load(os.path.join(mib_path, 'SNMPv2-SMI.txt'))
        load(os.path.join(mib_path, 'SNMPv2-TC.txt'))
        load(os.path.join(mib_path, 'SNMPv2-CONF.txt'))
        load(os.path.join(mib_path, 'SNMPv2-MIB.txt'))
        load(os.path.join(mib_path, 'IANAifType-MIB.txt'))
        load(os.path.join(mib_path, 'IF-MIB.txt'))

        list_count = len(switch_list) 
        if list_count < 1:
            print("Warning: List of switches is empty. Re-feed....")
            # get_switches()
        for device in switch_list:
            hostname = str(device.login)
            ip = str(device.ip)
            comm = str(device.snmp_community)
            if LOG_DEBUG == True:
                print("Testing Switch: " + hostname + " " + ip + " " + comm)
            try:
                m = M(host=hostname, retries=0, community=comm, version=2, timeout=SWITCH_TIMEOUT, cache=False) 
            except gaierror as ge:
                # print(traceback.format_exc())
                if LOG_DEBUG == True:
                    print("Snimpy Manager Exception: gaierror")

            metrics = {}
            try:
                for index in m.ifDescr:  
                    metrics[m.ifDescr[index]] = {}
                    if "thernet" in str(m.ifDescr[index]) and "unrouted" not in str(m.ifDescr[index]) and "Vlan" not in str(m.ifDescr[index]) and "Null" not in str(m.ifDescr[index]):
                        metrics[m.ifDescr[index]]['ifHCOutOctets'] = m.ifHCOutOctets[index]
                        metrics[m.ifDescr[index]]['ifHCInOctets'] = m.ifHCInOctets[index]
                                # print(index, repr(m.ifDescr[index]), repr(m.ifHCOutOctets[index]), repr(m.ifHCInOctets[index]))
            except SNMPNoSuchInstance as nie:
                # print(traceback.format_exc())
                print("SNMPNoSuchInstance for " + hostname + " " + ip + " " + comm)
            except SNMPException as snmpe:
                Switches.objects.filter(login=hostname, marque='switch').update(banned=1)
                print("SNMPException for " + hostname + " " + ip + " " + comm + " | List Count: " + str(list_count))
                # print(traceback.format_exc())
                pass
            except ValueConstraintError as e:
                print("ValueConstraintError for " +  hostname + " " + ip + " " + comm)
                # print(traceback.format_exc())
            lines= []
            for if_descr in list(metrics.keys()):
                for key in list(metrics[if_descr].keys()):
                    line = 'masac.switch.%s.%s.%s %d %d' % (hostname.replace(".","_"), if_descr.replace("/","_").replace(" ","_"), key, metrics[if_descr][key], timestamp)
                    lines.append(line)                  
            message = '\n'.join(lines) + '\n'
            if LOG_DEBUG == True:
                print("MESSAGE for " + hostname + " : " + message)
            sock.sendall(message.encode('UTF-8'))
        sock.close()
        time.sleep(DELAY)
    except Exception as e:
        print(traceback.format_exc())

Django DEBUG is set to False

hanynowsky commented 9 years ago

Hello @vincentbernat Any news about this issue?

vincentbernat commented 9 years ago

I think the valgrind output is misleading and the memory leak doesn't happen in the CFFI module. I have fixed a small memory leak in the CFFI module but I think this is not related to what you are saying. I still need to find some time to check the issue. I suppose that the issue could be tested by running the first snippet you provided into a loop and use some tool to find the memory leak.

vincentbernat commented 9 years ago

I have pushed a new commit that should fix the problem. For the record, here is the code I have used to "trace" memory leaks.

#!/usr/bin/python3
from snimpy.manager import Manager as M
from snimpy.manager import load
import tracemalloc
import gc

def sdsl_test(hostname, community):
    load('IF-MIB')
    load('SNMPv2-MIB')
    m = M(host=hostname, retries=0, community=community, version=2,
          timeout=1, cache=0)
    uptime = m.sysUpTime
    print(uptime)
    for ix in m.ifDescr:
        desc = m.ifDescr[ix]
        metric = m.ifHCInOctets[ix]
        print(desc + " " + str(metric))

tracemalloc.start()
sdsl_test('127.0.0.1', 'public')
gc.collect()
snapshot1 = tracemalloc.take_snapshot()
for i in range(1000):
    sdsl_test('127.0.0.1', 'public')
gc.collect()
snapshot2 = tracemalloc.take_snapshot()

top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("[ Top 10 differences ]")
for stat in top_stats[:10]:
    print(stat)

hanynowsky commented 9 years ago

Ok great. I am going to test again. I guess it's 0.8.6 now? Did it land under pip ?

vincentbernat commented 9 years ago

I didn't do a release, in case there is something else to fix. You can also just apply the patch manually. It is pretty simple: https://github.com/vincentbernat/snimpy/commit/cce05015ac2511b14592a276639445230f0c91ae

hanynowsky commented 9 years ago

Sounds like the magic patch reduced the memory leak from 100% to 1%. That's great. Thanks for your responsiveness and great work. masac

vincentbernat commented 9 years ago

Due to #35, I have reverted the fix. I have asked for help in the PySNMP mailing list and hopefully will come with a proper fix. Otherwise, I'll try to hack around.

vincentbernat commented 8 years ago

I forgot about this bug. The current situation with 0.8.8 is that only one command generator per thread will be used for SNMPv2. This is not the case with SNMPv3 as the command generator cannot be reused between sessions in this case. So, you shouldn't get a memory leak with 0.8.8 either in your case.

hanynowsky commented 8 years ago

That is excellent news. I am gonna put that to test in my next iteration. Thanks for the great job! Snimpy by the way is still in production in our company and it's rock solid!

vincentbernat / snimpy

Snimpy Memory Leak #33