Closed Gomsilruk closed 6 years ago
Hi, please first go over #12 which uncovered some bugs which were (hopefully) fixed. In addition Also, please provide a sample code that reproduces this behaviour.
I've found the bug and fixed it. will be in the next version. thanks for reporting !
Any chance the next version might be released in the near future? I'm running into this particular issue quite a bit.
can you check that the fix i've implemented solves the issue in your case ? i want to make sure, thanks!
I applied the diff last week and am using a local build now. I'm still seeing random spikes in CPU after connection drops, so I'm not sure that the patch resolved the issue. Running in production with a low sampling frequency profiler says that there is contention with the .NET global timer queue lock (used when calling Task.Delay):
While running the VS profiler locally, I'm seeing an increase in CPU just after reconnection:
My retry scenario for all ZK calls catches the ConnectionLossException and awaits the SyncConnected event before continuing---I'm not sure if this makes a difference in the symptoms.
Thanks!
the issue here is about high memory and CPU consumption. it seems that my fix solved the memory problem. if you put a delay in your mentioned retry code, does it reduce CPU ? i'm trying to pinpoint where the CPU spike manifests. thanks !
My previous implementation had a Task.Delay instead of the await on the event, but I was concerned with the added contention on the timer queue. My process has several hundred independent async threads of execution all depending on calls to ZooKeeper for synchronization.
I'm not sure putting on a delay would make an improvement.
the question is whether the CPU spike is causes by the retry loop you've mentioned, or the ZK client gets into a tight loop until it finds out it has been disconnected. in your case, do you have several hundred threads or async actions that use ZK for synchronization ?
So my retry loop awaits the SyncConnected event before attempting to try again, so there shouldn't be any tight loops from my side. The await should remove the task from the thread pool, so there shouldn't be any spin waiting either.
i see, thank you
@j03wang, please test the latest commit. i think i (finally) solved the issue
I applied the patch--I'm still seeing this from the VS profiler (where the two humps at 35s and 50s are ramps in CPU appear immediately after a reconnection--in my testing, I kill the local zk service, and then restart it), but I'll also test this on in prod and report back later this week.
what was the configuration in your local test ? is it a local node using default configuration ? please provide as much information you think is relevant for a repro. thanks
My local machine has 4 cores + HT (8 logical processors). My ZK local node setup uses the default configuration bundled with the distribution--I'm using 3.4.10. My local process runs 16 independent "workers" (async threads of execution) that make 3-4 calls to ZK approximately every 10 seconds or so. The manual testing I'm doing is just killing the ZK node, waiting anywhere between 1 to 10 seconds, and then starting the ZK node.
Depending on what's executing, there maybe a handful of ConnectionLossExceptions caught, but I don't think the number of exceptions correlates to the CPU uptick. The uptick doesn't seem to be deterministic either--sometimes the process reconnects successfully without the CPU increase, sometimes it reconnects successfully (as indicated by the SyncConnected event) but then has the increase in CPU.
thanks a lot, i am now able to reproduce the issue. it seems that it's something with how the client gets disconnected from the node. for some reason the receive callback returns almost instantly with 0 bytes buffer and with no error. i will fix it this time, now it's personal 🤔
I've tested the latest batch of changes. it seems that the race condition is finally solved. @j03wang, can you please confirm ?
Yes--it seems like with the latest commit, I'm no longer seeing the symptoms. Thanks!
Application process used 447 MB memory, and had 70 threads after it is reset. When I kill Zookeeper server, the app try to re-connect to Zookeeper server. At this moment, CPU usage, memory usage and thread increases. They keep increase up to 1605 MB for memory, 262 threads. CPU usage hits 100 max from time to time. Please find attached .gif in order to help your understand. Please see image below Also please kindly focus on dotnet.exe process.
This was tested on dotnet core 2.1 base and on Windows 10, CentOS 7. This issue happens on all version that supports .NET core (3.4.8.5, 3.4.9.0, 3.4.9.2, 3.4.9.3, 3.4.9.4, 3.4.10.1) Is there any way to resolve/improve this issue?