shayhatsor / zookeeper

Apache ZooKeeper .NET async Client
https://nuget.org/packages/ZooKeeperNetEx/
Apache License 2.0
236 stars 53 forks source link

CPU and memory usage increase when the app tries to re-connect to Zookeeper server #25

Closed Gomsilruk closed 6 years ago

Gomsilruk commented 6 years ago

Application process used 447 MB memory, and had 70 threads after it is reset. When I kill Zookeeper server, the app try to re-connect to Zookeeper server. At this moment, CPU usage, memory usage and thread increases. They keep increase up to 1605 MB for memory, 262 threads. CPU usage hits 100 max from time to time. Please find attached .gif in order to help your understand. Please see image below Also please kindly focus on dotnet.exe process.

image

image

This was tested on dotnet core 2.1 base and on Windows 10, CentOS 7. This issue happens on all version that supports .NET core (3.4.8.5, 3.4.9.0, 3.4.9.2, 3.4.9.3, 3.4.9.4, 3.4.10.1) Is there any way to resolve/improve this issue?

shayhatsor commented 6 years ago

Hi, please first go over #12 which uncovered some bugs which were (hopefully) fixed. In addition Also, please provide a sample code that reproduces this behaviour.

shayhatsor commented 6 years ago

I've found the bug and fixed it. will be in the next version. thanks for reporting !

j03wang commented 6 years ago

Any chance the next version might be released in the near future? I'm running into this particular issue quite a bit.

shayhatsor commented 6 years ago

can you check that the fix i've implemented solves the issue in your case ? i want to make sure, thanks!

j03wang commented 6 years ago

I applied the diff last week and am using a local build now. I'm still seeing random spikes in CPU after connection drops, so I'm not sure that the patch resolved the issue. Running in production with a low sampling frequency profiler says that there is contention with the .NET global timer queue lock (used when calling Task.Delay): image

While running the VS profiler locally, I'm seeing an increase in CPU just after reconnection: image

My retry scenario for all ZK calls catches the ConnectionLossException and awaits the SyncConnected event before continuing---I'm not sure if this makes a difference in the symptoms.

Thanks!

shayhatsor commented 6 years ago

the issue here is about high memory and CPU consumption. it seems that my fix solved the memory problem. if you put a delay in your mentioned retry code, does it reduce CPU ? i'm trying to pinpoint where the CPU spike manifests. thanks !

j03wang commented 6 years ago

My previous implementation had a Task.Delay instead of the await on the event, but I was concerned with the added contention on the timer queue. My process has several hundred independent async threads of execution all depending on calls to ZooKeeper for synchronization.

I'm not sure putting on a delay would make an improvement.

shayhatsor commented 6 years ago

the question is whether the CPU spike is causes by the retry loop you've mentioned, or the ZK client gets into a tight loop until it finds out it has been disconnected. in your case, do you have several hundred threads or async actions that use ZK for synchronization ?

j03wang commented 6 years ago

So my retry loop awaits the SyncConnected event before attempting to try again, so there shouldn't be any tight loops from my side. The await should remove the task from the thread pool, so there shouldn't be any spin waiting either.

shayhatsor commented 6 years ago

i see, thank you

shayhatsor commented 6 years ago

@j03wang, please test the latest commit. i think i (finally) solved the issue

j03wang commented 6 years ago

I applied the patch--I'm still seeing this from the VS profiler (where the two humps at 35s and 50s are ramps in CPU appear immediately after a reconnection--in my testing, I kill the local zk service, and then restart it), but I'll also test this on in prod and report back later this week.

image

shayhatsor commented 6 years ago

what was the configuration in your local test ? is it a local node using default configuration ? please provide as much information you think is relevant for a repro. thanks

j03wang commented 6 years ago

My local machine has 4 cores + HT (8 logical processors). My ZK local node setup uses the default configuration bundled with the distribution--I'm using 3.4.10. My local process runs 16 independent "workers" (async threads of execution) that make 3-4 calls to ZK approximately every 10 seconds or so. The manual testing I'm doing is just killing the ZK node, waiting anywhere between 1 to 10 seconds, and then starting the ZK node.

Depending on what's executing, there maybe a handful of ConnectionLossExceptions caught, but I don't think the number of exceptions correlates to the CPU uptick. The uptick doesn't seem to be deterministic either--sometimes the process reconnects successfully without the CPU increase, sometimes it reconnects successfully (as indicated by the SyncConnected event) but then has the increase in CPU.

shayhatsor commented 6 years ago

thanks a lot, i am now able to reproduce the issue. it seems that it's something with how the client gets disconnected from the node. for some reason the receive callback returns almost instantly with 0 bytes buffer and with no error. i will fix it this time, now it's personal 🤔

shayhatsor commented 6 years ago

I've tested the latest batch of changes. it seems that the race condition is finally solved. @j03wang, can you please confirm ?

j03wang commented 6 years ago

Yes--it seems like with the latest commit, I'm no longer seeing the symptoms. Thanks!