tiian / flom

FLoM: Easy to Use Distributed Lock Manager; a tool to synchronize the execution of commands, shell scripts and custom programs.
http://www.tiian.org/flom/
GNU General Public License v2.0
27 stars 6 forks source link

memory leak? #10

Closed moskvich412 closed 4 years ago

moskvich412 commented 5 years ago

After running FLoM with literally more than 500,000 locks on the same and only resource, memory usage is 10G. I suspect there is a memory leak somewhere.

tiian commented 5 years ago

Difficult to say without proper information. Please post the exact sequence of commands you use and/or any script you are using: I will try to reproduce it under the supervision of valgrind. You can do even your own analysis: sample commands are available in script valgrind.cmd Note: environment variable G_SLICE is essential.

moskvich412 commented 4 years ago

FLoM runs as daemon, It is daemon that consumes memory.

I ran this command to start daemon. At first report from the parent shows no leaks (not attached).

export G_SLICE=always-malloc valgrind --leak-check=full --trace-children=yes --show-leak-kinds=all flom -a ${myIPAddress} -p ${myPort} -d -1 -- true

After daemon is stopped using flom -a ${myIPAddress} -p ${myPort} -x

an error report is produced. While the daemon was running, 750 processes (from different computers) were inundating the daemon once a minute for 10 minutes.

memoryErrorReport.txt

Is this a useful information to trace the cause?

moskvich412 commented 4 years ago

Just to add, the 750 tasks were executing this about every minute:

flom -a ${myIPAddress} -p ${myPort} -d 0 -o 500000 -l EX -r "alwaysTheSameString" -- command

The command took about 100msec to complete.

tiian commented 4 years ago

Yes, there are useful information. This does not sound good:

==184340== 1,080,138 bytes in 124,631 blocks are indirectly lost in loss record 138 of 139 ==184340== at 0x4C2928A: malloc (vg_replace_malloc.c:299) ==184340== by 0x507FD44: g_malloc (in /lib64/libglib-2.0.so.0.2800.8) ==184340== by 0x50992D8: g_strndup (in /lib64/libglib-2.0.so.0.2800.8) ==184340== by 0x50993CA: g_strsplit (in /lib64/libglib-2.0.so.0.2800.8) ==184340== by 0x5B74696: flom_resource_hier_waitings (flom_resource_hier.c:890) ==184340== by 0x5B74B5B: flom_resource_hier_clean (flom_resource_hier.c:651) ==184340== by 0x5B74F13: flom_resource_hier_inmsg (flom_resource_hier.c:461) ==184340== by 0x5B6D776: flom_locker_loop_pollin (flom_locker.c:488) ==184340== by 0x5B6DBC9: flom_locker_loop (flom_locker.c:276) ==184340== by 0x50A03E3: ??? (in /lib64/libglib-2.0.so.0.2800.8) ==184340== by 0x5354AA0: start_thread (in /lib64/libpthread-2.12.so) ==184340== by 0x6327BCC: clone (in /lib64/libc-2.12.so)

It will require some time to produce a patch. I need a further information: do you build from github source code or from sourceforge tarball? It will be faster to produce an intermediate git commit than producing a new tarball, even because I'm not sure to catch the root cause at the first shot.

moskvich412 commented 4 years ago

Our system administrator builds and installs. I see this flom version:

flom --version FLoM: Free Lock Manager Copyright (c) 2013-2018, Christian Ferrari; all rights reserved. License: GPL (GNU Public License) version 2 Package name: flom; package version: 1.4.0; release date: 2018-05-10 TLS supported protocol(s): TLSv1 Access http://sourceforge.net/projects/flom/ for project community activities

moskvich412 commented 4 years ago

it seems that if lock requests come sequentially, then memory usage does not grow. If so, cause must be in the queueing of requests.

tiian commented 4 years ago

Some information do not match:

With the following commands you can inspect what type of resource is really managed by the flom daemon:

export FLOM_TRACE_MASK=0x2400 flom -t /tmp/mytrace.txt -a 192.168.123.53 -p 2345 -d -1 -- true

file /tmp/mytrace.txt will contain the trace produced by flom daemon. Please check if function "flom_resource_simple_waitings" is called or function "flom_resource_hier_waitings" is called.

moskvich412 commented 4 years ago

I see both. Example:

2019-11-08 14:04:13.608035 [280442/0x11dd0c0] flom_resource_simple_waitings 2019-11-08 14:04:13.608056 [280442/0x11dd0c0] flom_resource_simple_waitings/excp=5/ret_cod=0/errno=0

2019-11-08 14:04:15.507667 [280442/0x11e2b20] flom_resource_hier_waitings 2019-11-08 14:04:15.507682 [280442/0x11e2b20] flom_resource_hier_waitings: NULL cl for i=0 2019-11-08 14:04:15.507696 [280442/0x11e2b20] flom_resource_hier_waitings/excp=6/ret_cod=0/errno=0

moskvich412 commented 4 years ago

My resources is a file, including full path. I see these:

2019-11-08 14:04:15.354509 [280442/0x11c7860] flom_resource_hier_init: level 0 is 'gs' 2019-11-08 14:04:15.354529 [280442/0x11c7860] flom_resource_hier_init: level 1 is 'gsfs0' 2019-11-08 14:04:15.354545 [280442/0x11c7860] flom_resource_hier_init: level 2 is 'users'

Is the presence of "/" causes issues? What does it mean?

moskvich412 commented 4 years ago

I see about special treatment of "/" (and others):

https://sourceforge.net/p/flom/wiki/Resource/

I removed these special chars and testing memory as before, using valgrind

moskvich412 commented 4 years ago

The trace file (for inadvertently hierarchical resource) has this message:

2019-11-08 14:04:15.508175 [280442/0x11e2b20] flom_resource_hier_clean: unable to locate node for resource level 'gsfs0', it was probably already cleaned-up

maybe this help to debug

moskvich412 commented 4 years ago

No leaks with simple resource.

tiian commented 4 years ago

Hierarchical resources make sense if you leverage the feature, else they just add complexity to manage. Anyway, please post the full and exact name of the hierarchical resource you used and that generated leaks: I will perform some more investigation.

moskvich412 commented 4 years ago

The full hierarchical resource was

/gs/gsfs0/users/xxxxxxxx/PIPELINE_SCRATCH/QUEMANAGER/agentDatabase.mysqlite

Here I replaced sensitive 8 alpha characters with xxxxxxxx. It has two FLoM special symbols "_" and "." in addition to the delimiter "/".

moskvich412 commented 4 years ago

The full hierarchical resource was

/gs/gsfs0/users/xxxxxxxx/PIPELINE_SCRATCH/QUEMANAGER/agentDatabase.mysqlite

Here I replaced sensitive 8 alpha characters with xxxxxxxx. It has two FLoM special symbols "_" and "." in addition to the delimiter "/".

tiian commented 4 years ago

I understand it's sensitive. Please replace the "x" with the "_" and "." Under the covers there are some complex regular expressions: I must be sure to reproduce the exact situation.

moskvich412 commented 4 years ago

The 8 x's were 8 lower case from the latin alphabet. The rest is exactly as it was. I do not think this replacement changes regular expression. Or could it?

tiian commented 4 years ago

Got it. I asked to be sure there were no special characters inside the "x"s.

moskvich412 commented 4 years ago

No special symbols behind x's. Now, I read "carefully" how resource string is interpreted and understand role of special symbols.

tiian commented 4 years ago

The good point is that the resource is classified as hierarchical and the regular expression works as expected:

2019-11-08 22:44:18.107533 [1674/0x1952930] flom_resource_hier_init: initializing resource ('/gs/gsfs0/users/xxxxxxxx/PIPELINE_SCRATCH/QUEMANAGER/agentDatabase.mysqlite') 2019-11-08 22:44:18.107547 [1674/0x1952930] flom_resource_hier_init: level 0 is 'gs' 2019-11-08 22:44:18.107552 [1674/0x1952930] flom_resource_hier_init: level 1 is 'gsfs0' 2019-11-08 22:44:18.107556 [1674/0x1952930] flom_resource_hier_init: level 2 is 'users' 2019-11-08 22:44:18.107560 [1674/0x1952930] flom_resource_hier_init: level 3 is 'xxxxxxxx' 2019-11-08 22:44:18.107565 [1674/0x1952930] flom_resource_hier_init: level 4 is 'PIPELINE_SCRATCH' 2019-11-08 22:44:18.107571 [1674/0x1952930] flom_resource_hier_init: level 5 is 'QUEMANAGER' 2019-11-08 22:44:18.107576 [1674/0x1952930] flom_resource_hier_init: level 6 is 'agentDatabase.mysqlite'

The bad point is that the leak does not happen with serial usage, as you documented above. The very bad point is that valgrind detected a leak on a temporary object that should be destroyed by the calling function 100% of the time.

It will take some time to set-up a proper test bed and even more to catch the wrong statement.

tiian commented 4 years ago

The issue can be reproduced quite easily. In one terminal, start a daemon using the parameters you reported:

valgrind --leak-check=full --trace-children=yes /usr/local/bin/flom -d -1 -- true

in another terminal execute a shell script like the following one:

!/bin/bash

Try to reproduce issue 10 memory leak

COUNTER=0 while [ $COUNTER -lt 10 ]; do echo The counter is $COUNTER let COUNTER=COUNTER+1 nohup /usr/local/bin/flom -d 0 -o 500000 -l EX -r '/gs/gsfs0/users/xxxxxxxx/PIPELINE_SCRATCH/QUEMANAGER/agentDatabase.mysqlite' -- sleep 1 & done

wait the termination of all the clients and then stop the daemon with

flom -x

It seems that 10 parallel clients are enough to exploit the memory leak you reported:

==1739== 5,004 (2,304 direct, 2,700 indirect) bytes in 36 blocks are definitely lost in loss record 158 of 165 ==1739== at 0x4C274A8: malloc (vg_replace_malloc.c:236) ==1739== by 0x5079864: g_malloc (in /lib/libglib-2.0.so.0.2400.1) ==1739== by 0x5092DA4: g_strsplit (in /lib/libglib-2.0.so.0.2400.1) ==1739== by 0x5AE4DC6: flom_resource_hier_waitings (flom_resource_hier.c:890) ==1739== by 0x5AE52A0: flom_resource_hier_clean (flom_resource_hier.c:651) ==1739== by 0x5AE565B: flom_resource_hier_inmsg (flom_resource_hier.c:461) ==1739== by 0x5ADDC43: flom_locker_loop_pollin (flom_locker.c:488) ==1739== by 0x5ADE211: flom_locker_loop (flom_locker.c:276) ==1739== by 0x5099C93: ??? (in /lib/libglib-2.0.so.0.2400.1) ==1739== by 0x53169C9: start_thread (pthread_create.c:300) ==1739== by 0x87A46FF: ???

now I have to investigate why it happens.

moskvich412 commented 4 years ago

Thank you for looking into this.

moskvich412 commented 4 years ago

Thank you for looking into this.

tiian commented 4 years ago

Commit d17147c fixes the bug. It will take some days to complete the certification process before releasing 1.4.1 If you have a chance to build from source, you can give it a chance.

tiian commented 4 years ago

Unfortunately some issues emerged running the non regression tests. They are probably not connected with the fix, but I have to understand why before releasing an official version. Stay tuned.

tiian commented 4 years ago

Release 1.4.1 is available: please test it to confirm it solves the memory leak you detected.

tiian commented 4 years ago

No feedback, closing.