shayhatsor / zookeeper

Apache ZooKeeper .NET async Client
https://nuget.org/packages/ZooKeeperNetEx/
Apache License 2.0
236 stars 53 forks source link

ZK client floods the memory with Task objects on connection loss #12

Closed pranaslu closed 7 years ago

pranaslu commented 7 years ago

When I add a FW rule to disable a ZK connection, then memory usage grows to 1GB in a few seconds. When I ran a profiler I saw that memory is flooded with Task objects, that were created in AsyncManualResetEvent class. This happens in production when CPU usage in the VM is high as well. First a ZooKeeper ConnectionLoss exception is logged, and soon the process memory starts to grow very fast. It seems like the ThreadPool doesn't give a chance for the connection ping Tasks to fire because the CPU usage is high. Then a ConnectionLoss exception is logged. And then the Client starts to create the Tasks objects without any Delay.

I tested this behaviour with 3.4.8.3 and 3.4.9.1.

Here's the NUnit test fixture:

[TestFixture]
public class ZookeeperMemoryLeakTest_Manual {
    // Zookeeper must run on Linux machine, this test adds firewall rule

    private static readonly string zkSrvIp = "192.168.60.10";
    private static readonly string zkSrvUser = "root";
    private static readonly string zkSrvPassw = "Password";
    private static readonly string zkPort = "2181";
    private static readonly string fwRule = $"INPUT -p tcp --dport {zkPort} -j DROP;";
    private static readonly string zkConnStr = $"{zkSrvIp}:{zkPort}";

    [TearDown]
    public virtual void TestFixtureTearDown() {
        Console.WriteLine("Removing Firewall rule");
        SshUtils.ExecuteSshCommand(zkSrvIp, zkSrvUser, zkSrvPassw, $"sudo iptables -D {fwRule}");
    }

    [Test]
    [Explicit]
    [Description("See Ram and Processor usage. Long running.")]
    public void MemoryLeakTest() {
        var zookeeper = new ZooKeeper(zkConnStr, 15000, new ClientWatch());
        while (zookeeper.getState() != ZooKeeper.States.CONNECTED) {
            Thread.Sleep(10);
        }
        // Add FW rule
        SshUtils.ExecuteSshCommand(zkSrvIp, zkSrvUser, zkSrvPassw, $"sudo iptables -I {fwRule}");
        var totalBytesOfMemoryUsedBefore = Process.GetCurrentProcess().WorkingSet64;
        Console.WriteLine("Waiting..");
        Thread.Sleep(20000);
        Console.WriteLine("Done Waiting..");
        long totalBytesOfMemoryUsedAfter = Process.GetCurrentProcess().WorkingSet64;
        var memoryUsageDiffMB = (totalBytesOfMemoryUsedAfter - totalBytesOfMemoryUsedBefore) / 1024 / 1024;
        Console.WriteLine($"memoryUsageDiffMB: {memoryUsageDiffMB} MB");
        Assert.Less(memoryUsageDiffMB, 100);
    }

    #region Watcher for zookeeper

    private class ClientWatch : Watcher {
        public override Task process(WatchedEvent we) {
            var eventState = we.getState();
            Console.WriteLine($"State from watcher: {eventState}");
            return Task.FromResult<object>(null);
        }
    }

    #endregion
}
shayhatsor commented 7 years ago

Thanks for reporting the bug. I usually don't use Linux, would you mind running the test with debug log level and post it here? I think I understand where the problem comes from but would like to make sure.

pranaslu commented 7 years ago

Thank you, Shay, for a quick response.

I got the same situation with the simplified test:

[Test]
public void MemoryLeakTest() {
    ZooKeeper.LogLevel = TraceLevel.Verbose;
    //var zookeeper = new ZooKeeper("111.111.111.111:2181", 15000, new ClientWatch());
    var zookeeper = new ZooKeeper("127.0.0.1:2181", 15000, new ClientWatch());
    var timeOut = TimeSpan.FromSeconds(30);
    var startTime = DateTime.Now;
    while (zookeeper.getState() != ZooKeeper.States.CONNECTED) {
        Thread.Sleep(10);
        if (DateTime.Now > startTime.Add(timeOut))
        {
            Assert.Fail("Test timeout.");
        }
    }
}

I ran this test on windows without a ZK server at all. 1) When I use a connStr "111.111.111.111:2181" the memory starts to grow after 15s timeout. See the ZK log attached as well as img with memory and cpu consumption graphs. 111.111.111.111_ZK.2016-11-30-10.25.07.926Z.txt

111 111 111 111_zkclientissue

2) When I use a connStr "127.0.0.1:2181" (there's no ZM server on my localhost) the situation is a bit different, but the memory and cpu hops are still visible. ZK log attached as well as img with memory and cpu consumption graphs. 127.0.0.1_ZK.2016-11-30-10.32.53.686Z.txt

127 0 0 1_zkclientissue

shayhatsor commented 7 years ago

@pranaslu, thanks a lot for the complete analysis of the issue ! I love the screen grabs :+1: Now that I have an easy repo on windows, it'll be much easier to identify the bug.

shayhatsor commented 7 years ago

@pranaslu, this doesn't reproduce on my machine. But looking closely at your screen grabs I do see something weird happening. The yellow marks represents GC activity which in your case correlates with the memory/CPU consumption. I suspect that you're using a VM with only one core. If that's the case, check your GC settings and try to add another core. Tell me if that helps or at least changes the results.

tomasdeml commented 7 years ago

I also observed some excessive CPU consumption when testing ZooKeeper server failover. When ZK client cannot connect to any server, it seems to be stuck in some tight reconnection loop. My machine has 1 socket, 4 cores totaling 8 logical processors. However I did not observe increase of memory allocation.

pranaslu commented 7 years ago

Thanks, @tomasdeml, for confirmation of an issue.

@shayhatsor, I ran the test on my laptop with "Intel® Core™ i7-6820HQ Processor" (4 cores, 8 threads). https://ark.intel.com/products/88970/Intel-Core-i7-6820HQ-Processor-8M-Cache-up-to-3_60-GHz.

I believe it's a combination of CPU power/no-of-threads and GC that triggers the memory growth.

My OS: image .NET runtimes: image

There are differences in GC in different .NET versions: http://stackoverflow.com/questions/5643147/determining-which-garbage-collector-is-running/8416915#8416915

I updated the test to print GCSetting and .NET version:

    [Test]
    public void MemoryLeakTest() {
        Get45or451FromRegistry();
        Console.WriteLine($"IsServerGC: {GCSettings.IsServerGC}, LatencyMode: {GCSettings.LatencyMode}.");
        ZooKeeper.LogLevel = TraceLevel.Verbose;
        var zookeeper = new ZooKeeper("111.111.111.111:2181", 15000, new ClientWatch());
        //var zookeeper = new ZooKeeper("127.0.0.1:2181", 15000, new ClientWatch());
        var timeOut = TimeSpan.FromSeconds(30);
        var startTime = DateTime.Now;
        while (zookeeper.getState() != ZooKeeper.States.CONNECTED) {
            Thread.Sleep(10);
            if (DateTime.Now > startTime.Add(timeOut)) {
                Assert.Fail("Test timeout.");
            }
        }
    }

    private static void Get45or451FromRegistry() {
        using (var ndpKey = Microsoft.Win32.RegistryKey.OpenBaseKey(Microsoft.Win32.RegistryHive.LocalMachine, Microsoft.Win32.RegistryView.Registry32).OpenSubKey("SOFTWARE\\Microsoft\\NET Framework Setup\\NDP\\v4\\Full\\")) {
            int releaseKey = Convert.ToInt32(ndpKey.GetValue("Release"));
            if (true) {
                Console.WriteLine("Version: " + CheckFor45DotVersion(releaseKey));
            }
        }
    }

    // Checking the version using >= will enable forward compatibility,  
    // however you should always compile your code on newer versions of 
    // the framework to ensure your app works the same. 
    private static string CheckFor45DotVersion(int releaseKey) {
        if (releaseKey >= 393273) {
            return "4.6 RC or later";
        }
        if ((releaseKey >= 379893)) {
            return "4.5.2 or later";
        }
        if ((releaseKey >= 378675)) {
            return "4.5.1 or later";
        }
        if ((releaseKey >= 378389)) {
            return "4.5 or later";
        }
        // This line should never execute. A non-null release key should mean 
        // that 4.5 or later is installed. 
        return "No 4.5 or later version detected";
    }

Here's what it prints on my machine:

Version: 4.6 RC or later
IsServerGC: False, LatencyMode: Interactive.

Some screenshots from MemoryProfiler: image image

tomasdeml commented 7 years ago

Just wanted to note that our app is configured to use server GC (<gcServer enabled="true" />).

@pranaslu I believe it's a combination of CPU power/no-of-threads and GC that triggers the memory growth.

It may be caused by TPL scaling number of available slots for Tasks according to available processors. More processors > more capacity for running Tasks > higher CPU utilization / contention...

pranaslu commented 7 years ago

I enabled gcServer in ReSharper and got a bit different situation, the CPU is still very high, less garbage collections and memory is growing faster: image

I wander if the Network Response time affects this. Theoretical situation: Lets say there's some uncontrolled loop in ZK client that puts a lot of "Ping ZK Server" tasks and each "Ping ZK Server" task sends a request and registers a WaitHandle with Task Continuation in the ThreadPool to fire a continuation Task when the Network IO responds. If there's a bigger network response time, then more WiatHandles with Task Continuations gets into the ThreadPool stack. If CPU is better then situation gets worse, because it puts Continuations into the ThreadPool stack faster. The bigger the stack, the more work for a ThreadPool to track WaitHandles...

shayhatsor commented 7 years ago

Guys, I have just successfully reproduced the bug. The profilers didn't work correctly with the .net core dlls. So i compiled the code in a normal csproj targeting .net 4.52 and ran the profiles. Just as @pranaslu reported, the bugs are there. I guess that the GC is just masking the memory leak problem.

pranaslu commented 7 years ago

Great news, @shayhatsor . Thank you for your effort.

shayhatsor commented 7 years ago

I believe I solved the bug. It's a one liner that was hard to find. I'll roll out a new version soon. Thanks again for providing me with all the information needed for the repo.

shayhatsor commented 7 years ago

@pranaslu and @tomasdeml, did you have time to test the latest release? I'd like to make sure the bug is considered solved on your end before I close the issue, thanks.

pranaslu commented 7 years ago

@shayhatsor, this test passed! I started to run more tests with parallel connections and ZK restarts. And I'm seeing similar behavior when memory and cpu increases suddenly. Unfortunately I'm on a tight release schedule now, but once it finishes I'll post more tests that reproduce such behavior. Thanks again for your help!

pranaslu commented 7 years ago

Hi, @shayhatsor, attached is a test case and test output that cause the memory growth. ZooKeeperNetExTests.zip testOutput.zip

There's 20 threads with a zookeeper client per thread. Each thread creates nodes for 60s. If I restart the zookeeper in the middle of the test, then memory grows very fast.

I tested with v 3.4.9.2.

shayhatsor commented 7 years ago

@pranaslu, I've been busy with another project and just now got the time to go over the bug. You've provided once again a great repro, thanks for that. I'm working on a fix and will roll it out soon. I'm counting on you to give me the thumbs up after you run your rigorous tests 😉

shayhatsor commented 7 years ago

@pranaslu, please test the latest version 3.4.9.3 it's on nuget, thanks

pranaslu commented 7 years ago

@shayhatsor , I tested 3.4.8.3 vs 3.4.9.2 vs 3.4.9.3. The same test as above, but for ~5min. In one test run I simply ran 20 threads and each thread creates nodes synchronously. In another test run I ran the same test but restarted the ZK in the middle.

When ZK is not restarted memory grows similarly in all versions.

When I restart the ZK: v3.4.8.3 doesn't show significant change - connections resumes, minimal memory increase. v3.4.9.2 memory grows very fast on restart. but connections resumes later and most of memory is cleaned. v3.4.9.3 memory grows very fast on restart. connections times out.

RESULTS: NOTE: timescale differs in diagnostic tolls, but all tests ran for ~5min, except the last one because it thows timeout exception. I guess the difference comes from the high cpu usage.

v3.4.8.3, 5min, no restart:

2017-04-03T16:25:03.1363600+03:00: Created nodes count 258213 2017-04-03T16:25:03.1378647+03:00: connectionLossExceptionCnt = 0 2017-04-03T16:25:03.1378647+03:00: nodeExistsExceptionCnt = 0 2017-04-03T16:25:03.1671898+03:00: Stopping tasks 2017-04-03T16:25:03.1671898+03:00: 20 threads node creations took: 250186ms

image

v3.4.9.2, 5min, no restart:

2017-04-03T13:12:00.2656108+03:00: Created nodes count 191722 2017-04-03T13:12:00.2671150+03:00: connectionLossExceptionCnt = 0 2017-04-03T13:12:00.2671150+03:00: nodeExistsExceptionCnt = 0 2017-04-03T13:12:00.2795140+03:00: Stopping tasks 2017-04-03T13:12:00.2795140+03:00: 20 threads node creations took: 250408ms

image

v3.4.9.3, 5min, no restart:

2017-04-03T13:52:32.2595529+03:00: Created nodes count 196086 2017-04-03T13:52:32.2605553+03:00: connectionLossExceptionCnt = 0 2017-04-03T13:52:32.2605553+03:00: nodeExistsExceptionCnt = 0 2017-04-03T13:52:32.2736087+03:00: Stopping tasks 2017-04-03T13:52:32.2736087+03:00: 20 threads node creations took: 250655ms

image

v3.4.8.3, 5min, zk server restart:

2017-04-03T16:39:59.2733432+03:00: Created nodes count 239159 2017-04-03T16:39:59.2762872+03:00: connectionLossExceptionCnt = 20 2017-04-03T16:39:59.2762872+03:00: nodeExistsExceptionCnt = 58

image

v3.4.9.2, 5min, zk server restart:

image

v3.4.9.3, 5min, zk server restart:

2017-04-03T13:22:29.5448282+03:00: Created nodes count 66484 2017-04-03T13:22:29.5458398+03:00: connectionLossExceptionCnt = 64 2017-04-03T13:22:29.5458398+03:00: nodeExistsExceptionCnt = 0

image

Thanks again for your help.

shayhatsor commented 7 years ago

@pranaslu, thanks again for your continued help. I'm currently trying to, once again, figure out how to solve this issue. The most interesting discovery you've made is that version 3.4.8.3 doesn't show this behavior. I might be able to use it. I just have one minor request, if you have time, I'm trying to build the simplest test possible to reproduce these errors, I'm sure you'd be able to help with that. For example, as you've noted, it can be reproduced without multiple threads.

shayhatsor commented 7 years ago

@pranaslu, I've just run your test again and I think that there's a problem with the test itself. Correct me if I'm wrong, each thread of the test runs a tight loop that creates new ZK nodes. The growth in memory and CPU in version 3.4.9.3 seems logical since the tight loop fails with connections loss (which is the expected behavior in this case). It'd be great to have your input on this, maybe I'm missing something.

shayhatsor commented 7 years ago

@pranaslu, I'm currently closing this issue as I consider it solved. If you happen to have time to review my previous comments and think It should still be opened, please do so.

pranaslu commented 7 years ago

Hey, @shayhatsor, sorry for a delay. I agree with your comment. My last test case isn't valid. The other tests passed.

Thank you a lot for your help.