networkservicemesh / cmd-nsmgr

Network Service Manager
Apache License 2.0
4 stars 21 forks source link

Memory jump in high-load scenario with nsc/nse release/v1.11.2+ #703

Open denis-tingaikin opened 2 months ago

denis-tingaikin commented 2 months ago

Sometimes the memory consumption is uncontrollable increasing in the nsmgr.

Reproduced with release v1.14.0-rc.2

image

szvincze commented 1 month ago

We tested with custom clients and endpoints that used SDK v1.13.2 where we observed the above described behavior. One time we tried to let the nsmgrs reach the memory limit and to be restarted. Surprisingly, after the nsmgr processes were OOMKilled then came up properly and did not show any significant increase of memory consumption. On this picture the restarts happened at 0:00 and 05:00.

Screenshot from 2024-09-23 08-37-31

After that we also tried to restart nsmgr manually to see if it had effect on the increasing memory consumption. We restarted nsmgrs in different ways in the different tests to see if the restart method matters, but haven't seen difference when we killed the process or deleted the pod etc, it always flattened the memory diagram.

Here I add some more pictures we took during our tests. nsmgr-memory-after-restart-1

Screenshot from 2024-09-23 08-36-34

Screenshot from 2024-09-23 08-35-59

Screenshot from 2024-09-23 08-35-14

Screenshot from 2024-09-23 08-34-43

Each test showed the same behavior. For some reason the restart of nsmgr completely eliminated the memory increase.

This last one shows the test when we let the system continue to run after the restart to see if it starts growing some time later but nothing suspicious happened for more that 11 hours: Screenshot from 2024-09-23 08-34-00

Do you have any idea what could cause this behavior?

denis-tingaikin commented 1 month ago

Do you have any idea what could cause this behavior?

My guess the problem is in the diff in the nsc from v1.11.2 to main. Most likely, monitor streams are not closing. Currently, I am working with quick reproduction; I will inform you when we get more information.

szvincze commented 1 month ago

Do you have any idea what could cause this behavior?

My guess the problem is in the diff in the nsc from v1.11.2 to main. Most likely, monitor streams are not closing. Currently, I am working with quick reproduction; I will inform you when we get more information.

Would it be an explanation why the memory graph flattens after restarting nsmgr?

szvincze commented 1 month ago

Just a small addition. It seems when we stop the test and delete the respective test namespace nsmgr gives back the memory it leaked, which was not the case previously. Screenshot from 2024-09-24 11-54-18

zolug commented 1 month ago

Hi @denis-tingaikin,

edit: I think I might have found a possible src of resource leaking. In our NSC the streams returned by MonitorConnections linger on, as they are using the main context... After running a test with fixed context handling for the streaming RPC, the steady increase in memory usage disappeared.

The NSC used in Meridio: https://github.com/Nordix/Meridio/blob/master/cmd/proxy/main.go#L243 (There's another NSC as well, but a memory increase is visible when using only the NSC above.)

Simply on a Kind cluster I also noticed the differences Szilard reported (by leaving the cluster intact without any traffic). The nsmgrs with the most memory hosted more (custom) NSEs (in both cases below 1 nsmgr had 3 NSEs, 1 nsmgr had 1 NSE, and 2 nsmgrs run without NSEs).

Meridio 1.1.1 Screenshot from 2024-10-01 17-48-16

Meridio 1.1.2 Screenshot from 2024-10-01 13-29-58

denis-tingaikin commented 1 month ago

Hello @zolug,

I mentioned this problem: https://github.com/networkservicemesh/cmd-nsmgr/issues/675 on the last call. The problem actually can be on the client side. At this moment we're avoiding it on the server side, which can be not a perfect solution. Currently planned to have a look a bit deeply into code on the client side.

denis-tingaikin commented 1 week ago

@szvincze Could we test ghcr.io/networkservicemesh/ci/cmd-nsmgr:d1e85e8 when you get a time?

Note: Focus on nsmgr mem/fd consumption. I expected that its not leaking.

szvincze commented 5 days ago

@szvincze Could we test ghcr.io/networkservicemesh/ci/cmd-nsmgr:d1e85e8 when you get a time?

Note: Focus on nsmgr mem/fd consumption. I expected that its not leaking.

We just ran a 6 hours test and focused on file descriptors. It looks very similar to our previous tests. There are a bit more FDs for nsmgr after the test, but that was the case with the workarounds too. Even though the figures are similar from memory consumption point of view, if we should also compare it to the v1.14.1 release then I assume we should run another 60 hours test with this build.

FDs-before-test FDs-after-3h FDs-after-6h FDs-after-test

denis-tingaikin commented 5 days ago

It looks good; it seems like we are already able to get rid of using workarounds in the next releases. Let's keep the open this ticket till next RC.