Open denis-tingaikin opened 2 months ago
We tested with custom clients and endpoints that used SDK v1.13.2 where we observed the above described behavior.
One time we tried to let the nsmgr
s reach the memory limit and to be restarted. Surprisingly, after the nsmgr
processes were OOMKilled then came up properly and did not show any significant increase of memory consumption. On this picture the restarts happened at 0:00 and 05:00.
After that we also tried to restart nsmgr
manually to see if it had effect on the increasing memory consumption. We restarted nsmgr
s in different ways in the different tests to see if the restart method matters, but haven't seen difference when we killed the process or deleted the pod etc, it always flattened the memory diagram.
Here I add some more pictures we took during our tests.
Each test showed the same behavior. For some reason the restart of nsmgr
completely eliminated the memory increase.
This last one shows the test when we let the system continue to run after the restart to see if it starts growing some time later but nothing suspicious happened for more that 11 hours:
Do you have any idea what could cause this behavior?
Do you have any idea what could cause this behavior?
My guess the problem is in the diff in the nsc from v1.11.2 to main. Most likely, monitor streams are not closing. Currently, I am working with quick reproduction; I will inform you when we get more information.
Do you have any idea what could cause this behavior?
My guess the problem is in the diff in the nsc from v1.11.2 to main. Most likely, monitor streams are not closing. Currently, I am working with quick reproduction; I will inform you when we get more information.
Would it be an explanation why the memory graph flattens after restarting nsmgr
?
Just a small addition. It seems when we stop the test and delete the respective test namespace nsmgr
gives back the memory it leaked, which was not the case previously.
Hi @denis-tingaikin,
edit: I think I might have found a possible src of resource leaking. In our NSC the streams returned by MonitorConnections linger on, as they are using the main context... After running a test with fixed context handling for the streaming RPC, the steady increase in memory usage disappeared.
The NSC used in Meridio: https://github.com/Nordix/Meridio/blob/master/cmd/proxy/main.go#L243 (There's another NSC as well, but a memory increase is visible when using only the NSC above.)
Simply on a Kind cluster I also noticed the differences Szilard reported (by leaving the cluster intact without any traffic). The nsmgrs with the most memory hosted more (custom) NSEs (in both cases below 1 nsmgr had 3 NSEs, 1 nsmgr had 1 NSE, and 2 nsmgrs run without NSEs).
Meridio 1.1.1
Meridio 1.1.2
Hello @zolug,
I mentioned this problem: https://github.com/networkservicemesh/cmd-nsmgr/issues/675 on the last call. The problem actually can be on the client side. At this moment we're avoiding it on the server side, which can be not a perfect solution. Currently planned to have a look a bit deeply into code on the client side.
@szvincze Could we test ghcr.io/networkservicemesh/ci/cmd-nsmgr:d1e85e8
when you get a time?
Note: Focus on nsmgr mem/fd consumption. I expected that its not leaking.
@szvincze Could we test
ghcr.io/networkservicemesh/ci/cmd-nsmgr:d1e85e8
when you get a time?Note: Focus on nsmgr mem/fd consumption. I expected that its not leaking.
We just ran a 6 hours test and focused on file descriptors. It looks very similar to our previous tests. There are a bit more FDs for nsmgr
after the test, but that was the case with the workarounds too. Even though the figures are similar from memory consumption point of view, if we should also compare it to the v1.14.1 release then I assume we should run another 60 hours test with this build.
It looks good; it seems like we are already able to get rid of using workarounds in the next releases. Let's keep the open this ticket till next RC.
Sometimes the memory consumption is uncontrollable increasing in the nsmgr.
Reproduced with release v1.14.0-rc.2