Memory Leak when doning massive lock/get/set/unlock/publish action [API-2066]

z-guohui-qian commented 1 year ago

This issue is happening when we use massive hazelcast lock/get/set/unlock/publish action.

As you can see in the screenshot of dotmemory, the heap size of the application is stable, but unmanaged memory size keep growing.

The sample project for this issue. MassiveTransfer.zip

emreyigit commented 1 year ago

Hi @z-guohui-qian, thank you for the feedback. We'll back to you after investigating the issue.

zpqrtbnk commented 1 year ago

Could you please indicate

Which version of the Hazelcast .NET client is used for this benchmark (seems to be 5.2.2)?
Version of the .NET runtime you are building for (seems to be 6.0)?
Which version of the .NET runtime is used for running the benchmark (maybe you have 7.0 installed)?
From your screenshot, I assume that the OS is Windows, correct?

z-guohui-qian commented 1 year ago

Which version of the Hazelcast .NET client is used for this benchmark (seems to be 5.2.2)?

5.2.2 yes but we also tried 5.1.1

Version of the .NET runtime you are building for (seems to be 6.0)?

6.0 yes but we also tried 7.0

Which version of the .NET runtime is used for running the benchmark (maybe you have 7.0 installed)?

6.0 yes and .net 7.0 installed yes

From your screenshot, I assume that the OS is Windows, correct?

This sample project is in Windows environment and also some other projects in docker linux dotnet environment. We found this leak in docker linux environment first.

We tried docker images like mcr.microsoft.com/dotnet/sdk:6.0-jammy and mcr.microsoft.com/dotnet/sdk:7.0-jammy

zpqrtbnk commented 1 year ago

Update: reproduced the situation by running the process for about 15hrs, Managed memory is totally stable, but unmanaged memory grows - very slowly (from 25MB to 32MB in 15hrs) - but it grows. Investigating.

zpqrtbnk commented 1 year ago

Notes: while running over a period of ~24hrs, number of threads is stable, number of process handles is stable, managed memory is rock stable as per detailed managed heaps comparisons, yet total memory (private bytes) goes up by about 5MB per 12hrs which although small... is not what we expect. Investigating.

zpqrtbnk commented 1 year ago

Notes: while running over a period of ~48hrs, everything is stable but the total memory (private bytes)... when monitoring with dotMemory. Our of curiosity, I left an un-monitored instance of the process run for 24hrs. The Virtual Memory Private Bytes as well as the Physical Memory Working Set are, guess what, absolutely rock stable. As in, without instrumentation, there seem to be zero unmanaged memory leak.

This heavily suggests that the leak is caused by instrumentation. Will try to confirm the hypothesis.

zpqrtbnk commented 1 year ago

Notes:

Inspecting a memory dump of the process being monitored by dotMemory:

>!exts.heap -s -h 

*********************************************************************************
                                              NT HEAP STATS BELOW
*********************************************************************************
LFH Key                   : 0x5749e3ae1d5d6576
Termination on corruption : ENABLED
          Heap     Flags   Reserv  Commit  Virt   Free  List   UCR  Virt  Lock  Fast 
                            (k)     (k)    (k)     (k) length      blocks cont. heap 
-------------------------------------------------------------------------------------
000001d54fbf0000 00000002   32944  22960  32552    795   316     6    0      b   LFH
000001d54f9a0000 00008000      64      4     64      2     1     1    0      0      
000001d551630000 00001002    1472    112   1080     32     7     2    0      0   LFH
000001d5515d0000 00001002      60      8     60      2     1     1    0      0      
000001d551580000 00041002      60      8     60      5     1     1    0      0      
-------------------------------------------------------------------------------------

>!heap -l
Searching the memory for potential unreachable busy blocks.
[...]
378102 potential unreachable blocks were detected.

Inspecting another dump of that same process, taken some hours later:

>!exts.heap -s -h 

*********************************************************************************
                                              NT HEAP STATS BELOW
*********************************************************************************
LFH Key                   : 0x5749e3ae1d5d6576
Termination on corruption : ENABLED
          Heap     Flags   Reserv  Commit  Virt   Free  List   UCR  Virt  Lock  Fast 
                            (k)     (k)    (k)     (k) length      blocks cont. heap 
-------------------------------------------------------------------------------------
000001d54fbf0000 00000002   49132  35524  48740    702   321     7    0      c   LFH
000001d54f9a0000 00008000      64      4     64      2     1     1    0      0      
000001d551630000 00001002    1472    112   1080     32     7     2    0      0   LFH
000001d5515d0000 00001002      60      8     60      2     1     1    0      0      
000001d551580000 00041002      60      8     60      5     1     1    0      0      
-------------------------------------------------------------------------------------

>!heap -l
Searching the memory for potential unreachable busy blocks.
[...]
748827 potential unreachable blocks were detected.

We see that the first heap size has grown, and the number of "potential unreachable blocks" has massively increased.

Now, if we inspect a dump of another run of the process, not monitored by dotMemory, after 24hrs of execution:

>!exts.heap -s -h

*********************************************************************************
                                              NT HEAP STATS BELOW
*********************************************************************************
LFH Key                   : 0x3ab6ce2b1b9890c7
Termination on corruption : ENABLED
          Heap     Flags   Reserv  Commit  Virt   Free  List   UCR  Virt  Lock  Fast 
                            (k)     (k)    (k)     (k) length      blocks cont. heap 
-------------------------------------------------------------------------------------
00000200f78f0000 00000002    8568   6500   8176   1554   173     5    0      f   LFH
    External fragmentation  24 % (173 free blocks)
00000200f76f0000 00008000      64      4     64      2     1     1    0      0      
00000200f9350000 00001002    1472    112   1080     32     7     2    0      0   LFH
00000200f94b0000 00001002      60      8     60      2     1     1    0      0      
00000200f9630000 00041002      60      8     60      5     1     1    0      0      
-------------------------------------------------------------------------------------

>!heap -l
Searching the memory for potential unreachable busy blocks.
Heap 00000200f78f0000
Heap 00000200f76f0000
Heap 00000200f9350000
Heap 00000200f94b0000
Heap 00000200f9630000
Scanning VM ...
Scanning references from 18183 busy blocks (3 MBytes) ......
Entry             User              Heap              Segment               Size  PrevSize  Unused    Flags
-------------------------------------------------------------------------------------------------------------
00000200f78f9ee0  00000200f78f9ef0  00000200f78f0000  00000200f78f0000        20        50         8  busy 
[...]
00000200fbe2aaa0  00000200fbe2aab0  00000200f78f0000  00000200f9cc5200        20      -            8  LFH;busy 
00000200fbe2ab40  00000200fbe2ab50  00000200f78f0000  00000200f9cc5200        20      -            8  LFH;busy 
1553 potential unreachable blocks were detected.

The first heap size is way smaller, as well as the number of "potential unreachable blocks" (which is several orders of magnitude smaller). Memory usages as reported by a tool such as SysInternals' Process Explorer show stable values for VM Private Bytes and PM Working Set.

At that point we start dotMemory to monitor that very process.

Approximatively 30 minutes later, Process Explorer shows that VM Private Bytes has raised by about 3MB and PM Working Set has noticeably increased too. The first heap size hasn't changed yet, but the number of "potential unreachable blocks" has already jumped to 8230. Another 30 minutes, that number is 13861 and the heap is raising.

About 12 hours later, the first heap size has reached 32MB and reports 267433 blocks.

At that point we detach dotMemory from the process. WinDBG reports that "the CLR has fully detached and unloaded the profiler". A few hours later, the first heap size has not grown, and it reports 267616 blocks. About 12 hours later, the first heap size has not changed and it reports 267714 blocks.

At that point everything seems to indicate that the leak is caused by dotMemory itself. Going to run a few more analysis before closing this issue.

github-actions[bot] commented 1 year ago

Internal Jira issue: API-2066

zpqrtbnk commented 1 year ago

Figuring out what exactly in dotMemory is leaking is out of the scope of this issue and highly time-consuming. I am going to close this issue as non-repro on the Hazelcast .NET Client side, as it cannot be reproduced when not monitoring with dotMemory.

Please feel free to reopen the issue with further details (context, etc). in case you feel like it.

z-guohui-qian commented 1 year ago

Thanks for your investigation. Between dotMemory and Hazecaslt .NET Client maybe some conflict inside. We tried monitoring our program with no hazelcast r/w and the dotmemory monitoring worked fine and no leaks happen. We will try some other profilers.

hazelcast / hazelcast-csharp-client

Memory Leak when doning massive lock/get/set/unlock/publish action [API-2066] #840