Memory leak in salt master worker threads

ywahl commented 8 years ago

Description of Issue/Question

Using a external pillar module following https://docs.saltstack.com/en/latest/topics/development/external_pillars.html

We observe a memory leak that slowly consumes all the memory of the server running the master.

The code of the external pillar module is

from __future__ import absolute_import
import os
import sys
import logging

log = logging.getLogger(__name__)

pil = {'root_private_pillar' : {'vms':{'master1' : 1}}}

def ext_pillar(minion_id, pillar, command):
    my_pillar = pil
    '''
    minion_id is hostname-xxxxxxxx random string of 33 characters
    '''
    m = minion_id[:len(minion_id) - 33]
    if m in my_pillar['root_private_pillar']['vms']:
        my_pillar['root_private_pillar']['my_self'] = my_pillar['root_private_pillar']['vms'][m]
    else:
        log.critical('%s not present in pillar' % m)
    return my_pillar

Setup

master config:

file_recv: True
interface: 0.0.0.0
interface: '::'
ipv6: True
worker_threads: 10
extension_modules: /srv/private_mods/
auto_accept: True
log_level: info
log_level_logfile: info
master_id: master1
ext_pillar:
  - private_pillar: getPillar.xml
rest_cherrypy:
  port: 8000
  host: '::'
  ssl_crt: /etc/pki/tls/certs/cert.pem
  ssl_key: /etc/pki/tls/certs/key.pem
external_auth:
  pam:
    saltuser:
      - .*
      - '@runner'
file_roots:
   base:
     - /srv/salt/

Steps to Reproduce Issue

To accelerate memory leak consumption run

while true; do salt '*' saltutil.refresh_pillar; done

Versions Report

salt --versions-report
           Salt: 2015.5.8
         Python: 2.6.6 (r266:84292, May 22 2015, 08:34:51)
         Jinja2: 2.2.1
       M2Crypto: 0.20.2
 msgpack-python: 0.4.6
   msgpack-pure: Not Installed
       pycrypto: 2.0.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.5.0
           RAET: Not Installed
            ZMQ: 4.0.5
           Mako: Not Installed
        Tornado: Not Installed
        timelib: Not Installed
       dateutil: 1.4.1

jfindlay commented 8 years ago

@ywahl, thanks for reporting. I can confirm this with 2015.8.8.

configs

# cat /etc/salt/master
log_fmt_console: '%(colorlevel)s %(colormsg)s'
extension_modules: /srv/ext
ext_pillar:
  - private_pillar: getPillar.xml

# cat /etc/salt/minion
log_fmt_console: '%(colorlevel)s %(colormsg)s'
master: localhost

root@jmoney-debian-8:~# tree /srv
/srv
├── ext
│   └── pillar
│       ├── leak.py
│       └── leak.pyc
└── salt

3 directories, 2 files

# cat /srv/ext/pillar/leak.py
from __future__ import absolute_import
import os
import sys
import logging

log = logging.getLogger(__name__)

pil = {'root_private_pillar' : {'vms':{'master1' : 1}}}

def ext_pillar(minion_id, pillar, command):
    my_pillar = pil
    '''
    minion_id is hostname-xxxxxxxx random string of 33 characters
    '''
    m = minion_id[:len(minion_id) - 33]
    if m in my_pillar['root_private_pillar']['vms']:
        my_pillar['root_private_pillar']['my_self'] = my_pillar['root_private_pillar']['vms'][m]
    else:
        log.critical('%s not present in pillar' % m)
    return my_pillar

commands

# salt jmoney-debian-8 test.ping
jmoney-debian-8:
    True

# free -m
             total       used       free     shared    buffers     cached
Mem:          2010        830       1179          5         36        453
-/+ buffers/cache:        340       1670
Swap:            0          0          0

# for i in {1..101} ; do salt jmoney-debian-8 saltutil.refresh_pillar &> /dev/null ; done
# free -m
             total       used       free     shared    buffers     cached
Mem:          2010       1436        574          5         38        456
-/+ buffers/cache:        941       1069
Swap:            0          0          0

# pkill -9 salt-master
# free -m
             total       used       free     shared    buffers     cached
Mem:          2010        655       1354          5         38        456
-/+ buffers/cache:        160       1849
Swap:            0          0          0

versions

# salt --versions
Salt Version:
           Salt: 2015.8.8

Dependency Versions:
         Jinja2: 2.7.3
       M2Crypto: Not Installed
           Mako: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
         Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5
           cffi: 0.8.6
       cherrypy: 3.5.0
       dateutil: 2.2
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
        libgit2: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: 1.2.3
      pycparser: 2.10
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: 0.8.2
        timelib: Not Installed

System Versions:
           dist: debian 8.3
        machine: x86_64
        release: 3.16.0-4-amd64
         system: debian 8.3

ywahl commented 8 years ago

Thanks, let me know if I can be of any help Yaron

cachedout commented 8 years ago

I'm confused. The test case that @jfindlay is written incorrectly, since that configuration will never load an external pillar. Given the directory structure posted, the configuration should be:

ext_pillar:
  - leak: getPillar.xml

@jfindlay Can you please re-do your test case with a valid configuration?

Now, there still may be a leak here with pillar refreshes. I'm going to try to reproduce this without an external pillar at all and see what I can get...

cachedout commented 8 years ago

I've spent the entire day trying to reproduce this and I haven't been able to find anything conclusive.

While it's true that the first several requests for a pillar to an MWorker to increase the amount of memory that's allocated, over time (i.e., the first hundred requests or so) this seems to level out and does not continue to rise. I attribute this to the python garbage collector needing time to hit its collection thresholds.

Forcing garbage collection after the pillar is generated on the master does force those unreachable objects to be immediately collected on each pillar generation and does prevent the initial memory increase, though since I'm not convinced yet that this is problem that continues to happen over time, this feels like an approach that's too aggressive.

So, what I'd like to do here is a couple things.

1) Below is a patch to force collection on each pillar generation. @ywahl I would like to see if you would put this on your master and see if the problem continues for you.

diff --git a/salt/master.py b/salt/master.py
index 17e183e..769833c 100644
--- a/salt/master.py
+++ b/salt/master.py
@@ -1485,6 +1485,11 @@ class AESFuncs(object):
         if func == '_return':
             return ret
         if func == '_pillar' and 'id' in load:
+            import gc
+            log.info('Garbage collected {0} objects after pillar generation'.format(gc.collect()))
             if load.get('ver') != '2' and self.opts['pillar_version'] == 1:
                 # Authorized to return old pillar proto
                 return self.crypticle.dumps(ret)

2) I would also like QA to continue testing this issue on a long-running master but in a way that allows us to actually see what's going on. By this, I mean we need to limit the number of master workers to 1 and we need to issue a pillar refresh command periodically (say once every ten seconds) and graph the memory usage of that mworker proc for a long while -- say a day or two. That should give us a much better idea of whether or not this is really leaking during pillar generation.

For the time being, I am setting this issue to Cannot Reproduce until we can get a much better test case that we can actually engineer against.

Thanks for taking the time here to send this in, @ywahl . Much appreciated.

ywahl commented 8 years ago

I will try that! Thanks

cachedout commented 8 years ago

@ywahl Were you ever able to test this?

ywahl commented 8 years ago

Hi, in our full software, it didn't notice any substantial difference in memory consumption. We found that a big part of the memory lost when generating pillars was due to some issues in the cassandra driver for python that don't releases memory when a connection is closed.

I didn't had the time to recreate a minimal environment without access to the DB, and reproduce the issue without and with your fix.

Hope that I soon be able to test that out and let you know

cachedout commented 8 years ago

Ahh. That would make a lot of sense. Yes, if you can pin this down to a problem with the cassandra returner then we can definitely isolate and fix this. Thanks!

meggiebot commented 8 years ago

@ywahl have you had a chance to test this?

ywahl commented 8 years ago

We finalized our research. The problem lies not with the returner, (we don't use it), but with the python cassandra driver See https://datastax-oss.atlassian.net/browse/PYTHON-482?jql=text%20~%20%22leak%22

After upgrading the python cassandra driver to 3.6.0 the problem seems to be solved.

Thanks for all the assistance.

Should I close the bug?

cachedout commented 8 years ago

@yashlyft Yes, please close the bug and thank you very much for your help.

saltstack / salt