Closed moskvich412 closed 4 years ago
Difficult to say without proper information. Please post the exact sequence of commands you use and/or any script you are using: I will try to reproduce it under the supervision of valgrind. You can do even your own analysis: sample commands are available in script valgrind.cmd Note: environment variable G_SLICE is essential.
FLoM runs as daemon, It is daemon that consumes memory.
I ran this command to start daemon. At first report from the parent shows no leaks (not attached).
export G_SLICE=always-malloc valgrind --leak-check=full --trace-children=yes --show-leak-kinds=all flom -a ${myIPAddress} -p ${myPort} -d -1 -- true
After daemon is stopped using flom -a ${myIPAddress} -p ${myPort} -x
an error report is produced. While the daemon was running, 750 processes (from different computers) were inundating the daemon once a minute for 10 minutes.
Is this a useful information to trace the cause?
Just to add, the 750 tasks were executing this about every minute:
flom -a ${myIPAddress} -p ${myPort} -d 0 -o 500000 -l EX -r "alwaysTheSameString" -- command
The command took about 100msec to complete.
Yes, there are useful information. This does not sound good:
==184340== 1,080,138 bytes in 124,631 blocks are indirectly lost in loss record 138 of 139 ==184340== at 0x4C2928A: malloc (vg_replace_malloc.c:299) ==184340== by 0x507FD44: g_malloc (in /lib64/libglib-2.0.so.0.2800.8) ==184340== by 0x50992D8: g_strndup (in /lib64/libglib-2.0.so.0.2800.8) ==184340== by 0x50993CA: g_strsplit (in /lib64/libglib-2.0.so.0.2800.8) ==184340== by 0x5B74696: flom_resource_hier_waitings (flom_resource_hier.c:890) ==184340== by 0x5B74B5B: flom_resource_hier_clean (flom_resource_hier.c:651) ==184340== by 0x5B74F13: flom_resource_hier_inmsg (flom_resource_hier.c:461) ==184340== by 0x5B6D776: flom_locker_loop_pollin (flom_locker.c:488) ==184340== by 0x5B6DBC9: flom_locker_loop (flom_locker.c:276) ==184340== by 0x50A03E3: ??? (in /lib64/libglib-2.0.so.0.2800.8) ==184340== by 0x5354AA0: start_thread (in /lib64/libpthread-2.12.so) ==184340== by 0x6327BCC: clone (in /lib64/libc-2.12.so)
It will require some time to produce a patch. I need a further information: do you build from github source code or from sourceforge tarball? It will be faster to produce an intermediate git commit than producing a new tarball, even because I'm not sure to catch the root cause at the first shot.
Our system administrator builds and installs. I see this flom version:
flom --version FLoM: Free Lock Manager Copyright (c) 2013-2018, Christian Ferrari; all rights reserved. License: GPL (GNU Public License) version 2 Package name: flom; package version: 1.4.0; release date: 2018-05-10 TLS supported protocol(s): TLSv1 Access http://sourceforge.net/projects/flom/ for project community activities
it seems that if lock requests come sequentially, then memory usage does not grow. If so, cause must be in the queueing of requests.
Some information do not match:
With the following commands you can inspect what type of resource is really managed by the flom daemon:
export FLOM_TRACE_MASK=0x2400 flom -t /tmp/mytrace.txt -a 192.168.123.53 -p 2345 -d -1 -- true
file /tmp/mytrace.txt will contain the trace produced by flom daemon. Please check if function "flom_resource_simple_waitings" is called or function "flom_resource_hier_waitings" is called.
I see both. Example:
2019-11-08 14:04:13.608035 [280442/0x11dd0c0] flom_resource_simple_waitings 2019-11-08 14:04:13.608056 [280442/0x11dd0c0] flom_resource_simple_waitings/excp=5/ret_cod=0/errno=0
2019-11-08 14:04:15.507667 [280442/0x11e2b20] flom_resource_hier_waitings 2019-11-08 14:04:15.507682 [280442/0x11e2b20] flom_resource_hier_waitings: NULL cl for i=0 2019-11-08 14:04:15.507696 [280442/0x11e2b20] flom_resource_hier_waitings/excp=6/ret_cod=0/errno=0
My resources is a file, including full path. I see these:
2019-11-08 14:04:15.354509 [280442/0x11c7860] flom_resource_hier_init: level 0 is 'gs' 2019-11-08 14:04:15.354529 [280442/0x11c7860] flom_resource_hier_init: level 1 is 'gsfs0' 2019-11-08 14:04:15.354545 [280442/0x11c7860] flom_resource_hier_init: level 2 is 'users'
Is the presence of "/" causes issues? What does it mean?
I see about special treatment of "/" (and others):
https://sourceforge.net/p/flom/wiki/Resource/
I removed these special chars and testing memory as before, using valgrind
The trace file (for inadvertently hierarchical resource) has this message:
2019-11-08 14:04:15.508175 [280442/0x11e2b20] flom_resource_hier_clean: unable to locate node for resource level 'gsfs0', it was probably already cleaned-up
maybe this help to debug
No leaks with simple resource.
Hierarchical resources make sense if you leverage the feature, else they just add complexity to manage. Anyway, please post the full and exact name of the hierarchical resource you used and that generated leaks: I will perform some more investigation.
The full hierarchical resource was
/gs/gsfs0/users/xxxxxxxx/PIPELINE_SCRATCH/QUEMANAGER/agentDatabase.mysqlite
Here I replaced sensitive 8 alpha characters with xxxxxxxx. It has two FLoM special symbols "_" and "." in addition to the delimiter "/".
The full hierarchical resource was
/gs/gsfs0/users/xxxxxxxx/PIPELINE_SCRATCH/QUEMANAGER/agentDatabase.mysqlite
Here I replaced sensitive 8 alpha characters with xxxxxxxx. It has two FLoM special symbols "_" and "." in addition to the delimiter "/".
I understand it's sensitive. Please replace the "x" with the "_" and "." Under the covers there are some complex regular expressions: I must be sure to reproduce the exact situation.
The 8 x's were 8 lower case from the latin alphabet. The rest is exactly as it was. I do not think this replacement changes regular expression. Or could it?
Got it. I asked to be sure there were no special characters inside the "x"s.
No special symbols behind x's. Now, I read "carefully" how resource string is interpreted and understand role of special symbols.
The good point is that the resource is classified as hierarchical and the regular expression works as expected:
2019-11-08 22:44:18.107533 [1674/0x1952930] flom_resource_hier_init: initializing resource ('/gs/gsfs0/users/xxxxxxxx/PIPELINE_SCRATCH/QUEMANAGER/agentDatabase.mysqlite') 2019-11-08 22:44:18.107547 [1674/0x1952930] flom_resource_hier_init: level 0 is 'gs' 2019-11-08 22:44:18.107552 [1674/0x1952930] flom_resource_hier_init: level 1 is 'gsfs0' 2019-11-08 22:44:18.107556 [1674/0x1952930] flom_resource_hier_init: level 2 is 'users' 2019-11-08 22:44:18.107560 [1674/0x1952930] flom_resource_hier_init: level 3 is 'xxxxxxxx' 2019-11-08 22:44:18.107565 [1674/0x1952930] flom_resource_hier_init: level 4 is 'PIPELINE_SCRATCH' 2019-11-08 22:44:18.107571 [1674/0x1952930] flom_resource_hier_init: level 5 is 'QUEMANAGER' 2019-11-08 22:44:18.107576 [1674/0x1952930] flom_resource_hier_init: level 6 is 'agentDatabase.mysqlite'
The bad point is that the leak does not happen with serial usage, as you documented above. The very bad point is that valgrind detected a leak on a temporary object that should be destroyed by the calling function 100% of the time.
It will take some time to set-up a proper test bed and even more to catch the wrong statement.
The issue can be reproduced quite easily. In one terminal, start a daemon using the parameters you reported:
valgrind --leak-check=full --trace-children=yes /usr/local/bin/flom -d -1 -- true
in another terminal execute a shell script like the following one:
COUNTER=0 while [ $COUNTER -lt 10 ]; do echo The counter is $COUNTER let COUNTER=COUNTER+1 nohup /usr/local/bin/flom -d 0 -o 500000 -l EX -r '/gs/gsfs0/users/xxxxxxxx/PIPELINE_SCRATCH/QUEMANAGER/agentDatabase.mysqlite' -- sleep 1 & done
wait the termination of all the clients and then stop the daemon with
flom -x
It seems that 10 parallel clients are enough to exploit the memory leak you reported:
==1739== 5,004 (2,304 direct, 2,700 indirect) bytes in 36 blocks are definitely lost in loss record 158 of 165 ==1739== at 0x4C274A8: malloc (vg_replace_malloc.c:236) ==1739== by 0x5079864: g_malloc (in /lib/libglib-2.0.so.0.2400.1) ==1739== by 0x5092DA4: g_strsplit (in /lib/libglib-2.0.so.0.2400.1) ==1739== by 0x5AE4DC6: flom_resource_hier_waitings (flom_resource_hier.c:890) ==1739== by 0x5AE52A0: flom_resource_hier_clean (flom_resource_hier.c:651) ==1739== by 0x5AE565B: flom_resource_hier_inmsg (flom_resource_hier.c:461) ==1739== by 0x5ADDC43: flom_locker_loop_pollin (flom_locker.c:488) ==1739== by 0x5ADE211: flom_locker_loop (flom_locker.c:276) ==1739== by 0x5099C93: ??? (in /lib/libglib-2.0.so.0.2400.1) ==1739== by 0x53169C9: start_thread (pthread_create.c:300) ==1739== by 0x87A46FF: ???
now I have to investigate why it happens.
Thank you for looking into this.
Thank you for looking into this.
Commit d17147c fixes the bug. It will take some days to complete the certification process before releasing 1.4.1 If you have a chance to build from source, you can give it a chance.
Unfortunately some issues emerged running the non regression tests. They are probably not connected with the fix, but I have to understand why before releasing an official version. Stay tuned.
Release 1.4.1 is available: please test it to confirm it solves the memory leak you detected.
No feedback, closing.
After running FLoM with literally more than 500,000 locks on the same and only resource, memory usage is 10G. I suspect there is a memory leak somewhere.