Closed peterverraedt closed 1 year ago
Thanks for the report. We'll look into it.
It looks like this is over ~24 hrs. During this period, are the transfers roughly the same size or are they totally different?
Do you see memory go up on every transfer?
We have the impression that for each individual transfer, the memory footprint of the corresponding process increases steadily in time, and after the transfer is handled the memory is freed again. So an individual transfer that completes before the memory is full, will run successfully; but if multiple transfers happen at the same time, there is less memory available per transfer, and it is more likely that one of them is killed.
We have ongoing transfers that constantly are syncing files, so roughly all transfers are the same, and consist of big files.
We'll do some tests to have detailed footprints of specific transfer scenarios.
I ran some tests, testing the irods globus connector with 2 data sets. The first one is 1.7TB large, consisting of 7226 files and the second one is 170GB consisting of 58 files. With the second dataset I could more easily trigger the memory error since they are all larger files ranging from 1.7GB to 6GB, so gridftp ran out of memory rapidly. The easiest way to trigger errors is to increase the setting "network use" to "aggressive" on an endpoint, this will quickly fill up all available memory (16GB) on the server:
A pmap dump of a process using up all of the memory : 3409651.log
I'm running some tests now and looking into it.
I found a couple of memory leaks. It doesn't have anything to do with the buffers used for transfer. Those are all freed as expected.
There are a couple of data structures that are created in the iRODS calls that needed to be freed. Once I did that I verified that memory no longer grows after multiple transfers.
We'll get a fix out ASAP.
Excellent.
@JustinKyleJames - I added checkboxes and checked 4-3-0-stable. Do we plan on cherry-picking this to 4-2-stable and making another release? If not, let me know and I will remove the checkbox for 4-2-stable. In any case, please cherry-pick to main. Thanks!
@JustinKyleJames - Please close if complete
Closing
[x] 4-2-stable
We suspect a memory leak in the irods globus connector. On a server that handles exclusively globus transfers through the irods connector, we see the following graph in memory usage:
We use the latest available version of the globus connect server and the 4-3-0-stable branch of this repository.