Closed ywahl closed 8 years ago
@ywahl, thanks for reporting. I can confirm this with 2015.8.8.
# cat /etc/salt/master
log_fmt_console: '%(colorlevel)s %(colormsg)s'
extension_modules: /srv/ext
ext_pillar:
- private_pillar: getPillar.xml
# cat /etc/salt/minion
log_fmt_console: '%(colorlevel)s %(colormsg)s'
master: localhost
root@jmoney-debian-8:~# tree /srv
/srv
├── ext
│ └── pillar
│ ├── leak.py
│ └── leak.pyc
└── salt
3 directories, 2 files
# cat /srv/ext/pillar/leak.py
from __future__ import absolute_import
import os
import sys
import logging
log = logging.getLogger(__name__)
pil = {'root_private_pillar' : {'vms':{'master1' : 1}}}
def ext_pillar(minion_id, pillar, command):
my_pillar = pil
'''
minion_id is hostname-xxxxxxxx random string of 33 characters
'''
m = minion_id[:len(minion_id) - 33]
if m in my_pillar['root_private_pillar']['vms']:
my_pillar['root_private_pillar']['my_self'] = my_pillar['root_private_pillar']['vms'][m]
else:
log.critical('%s not present in pillar' % m)
return my_pillar
# salt jmoney-debian-8 test.ping
jmoney-debian-8:
True
# free -m
total used free shared buffers cached
Mem: 2010 830 1179 5 36 453
-/+ buffers/cache: 340 1670
Swap: 0 0 0
# for i in {1..101} ; do salt jmoney-debian-8 saltutil.refresh_pillar &> /dev/null ; done
# free -m
total used free shared buffers cached
Mem: 2010 1436 574 5 38 456
-/+ buffers/cache: 941 1069
Swap: 0 0 0
# pkill -9 salt-master
# free -m
total used free shared buffers cached
Mem: 2010 655 1354 5 38 456
-/+ buffers/cache: 160 1849
Swap: 0 0 0
# salt --versions
Salt Version:
Salt: 2015.8.8
Dependency Versions:
Jinja2: 2.7.3
M2Crypto: Not Installed
Mako: Not Installed
PyYAML: 3.11
PyZMQ: 14.4.0
Python: 2.7.9 (default, Mar 1 2015, 12:57:24)
RAET: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5
cffi: 0.8.6
cherrypy: 3.5.0
dateutil: 2.2
gitdb: 0.5.4
gitpython: 0.3.2 RC1
ioflo: Not Installed
libgit2: Not Installed
libnacl: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.2
mysql-python: 1.2.3
pycparser: 2.10
pycrypto: 2.6.1
pygit2: Not Installed
python-gnupg: Not Installed
smmap: 0.8.2
timelib: Not Installed
System Versions:
dist: debian 8.3
machine: x86_64
release: 3.16.0-4-amd64
system: debian 8.3
Thanks, let me know if I can be of any help Yaron
I'm confused. The test case that @jfindlay is written incorrectly, since that configuration will never load an external pillar. Given the directory structure posted, the configuration should be:
ext_pillar:
- leak: getPillar.xml
@jfindlay Can you please re-do your test case with a valid configuration?
Now, there still may be a leak here with pillar refreshes. I'm going to try to reproduce this without an external pillar at all and see what I can get...
I've spent the entire day trying to reproduce this and I haven't been able to find anything conclusive.
While it's true that the first several requests for a pillar to an MWorker to increase the amount of memory that's allocated, over time (i.e., the first hundred requests or so) this seems to level out and does not continue to rise. I attribute this to the python garbage collector needing time to hit its collection thresholds.
Forcing garbage collection after the pillar is generated on the master does force those unreachable objects to be immediately collected on each pillar generation and does prevent the initial memory increase, though since I'm not convinced yet that this is problem that continues to happen over time, this feels like an approach that's too aggressive.
So, what I'd like to do here is a couple things.
1) Below is a patch to force collection on each pillar generation. @ywahl I would like to see if you would put this on your master and see if the problem continues for you.
diff --git a/salt/master.py b/salt/master.py
index 17e183e..769833c 100644
--- a/salt/master.py
+++ b/salt/master.py
@@ -1485,6 +1485,11 @@ class AESFuncs(object):
if func == '_return':
return ret
if func == '_pillar' and 'id' in load:
+ import gc
+ log.info('Garbage collected {0} objects after pillar generation'.format(gc.collect()))
if load.get('ver') != '2' and self.opts['pillar_version'] == 1:
# Authorized to return old pillar proto
return self.crypticle.dumps(ret)
2) I would also like QA to continue testing this issue on a long-running master but in a way that allows us to actually see what's going on. By this, I mean we need to limit the number of master workers to 1 and we need to issue a pillar refresh command periodically (say once every ten seconds) and graph the memory usage of that mworker proc for a long while -- say a day or two. That should give us a much better idea of whether or not this is really leaking during pillar generation.
For the time being, I am setting this issue to Cannot Reproduce
until we can get a much better test case that we can actually engineer against.
Thanks for taking the time here to send this in, @ywahl . Much appreciated.
I will try that! Thanks
@ywahl Were you ever able to test this?
Hi, in our full software, it didn't notice any substantial difference in memory consumption. We found that a big part of the memory lost when generating pillars was due to some issues in the cassandra driver for python that don't releases memory when a connection is closed.
I didn't had the time to recreate a minimal environment without access to the DB, and reproduce the issue without and with your fix.
Hope that I soon be able to test that out and let you know
Ahh. That would make a lot of sense. Yes, if you can pin this down to a problem with the cassandra returner then we can definitely isolate and fix this. Thanks!
@ywahl have you had a chance to test this?
We finalized our research. The problem lies not with the returner, (we don't use it), but with the python cassandra driver See https://datastax-oss.atlassian.net/browse/PYTHON-482?jql=text%20~%20%22leak%22
After upgrading the python cassandra driver to 3.6.0 the problem seems to be solved.
Thanks for all the assistance.
Should I close the bug?
@yashlyft Yes, please close the bug and thank you very much for your help.
Description of Issue/Question
Using a external pillar module following https://docs.saltstack.com/en/latest/topics/development/external_pillars.html
We observe a memory leak that slowly consumes all the memory of the server running the master.
The code of the external pillar module is
Setup
master config:
Steps to Reproduce Issue
To accelerate memory leak consumption run
Versions Report