Closed tdanylchuk closed 6 years ago
I would think this problem could be mitigated by scaling out the config server with a couple of instances. Are your above findings based on a single instance?
Clustering of config service may help, but is it not excessive? Each node would be just stuck and resources won't be used at all. We would have huge overhead on java/tomcat but useful processing time would be like <10%
I dont understand why it would be excessive. This is one of the typical solutions to handle a lot of load. With the additional instances the app should not experience the problems you described.
See also #6 (suggested alternative implementation)
We ran into the similar issue and as by Ryan, were able to scale out to get around the limitation. Wanted to share what we noticed is that this is not simply a latency issue. The affected instance of the cloud config server doesn't recover once they get overwhelmed with requests unless restarted.
Below are 2 thread dumps showing all thread waiting on a monitor held by a thread doing a native file-system i/o which never returns.
Threads waiting for the lock
Thead holding the monitor
Environment:
And you have 250+ services hitting config server at the same time?
Confirm that config service just stuck and cannot process any requests w\o restart. And yes, there can be a case that even more service at the same time trying to get properties especially if clients healthcheck is enabled. In our case problem occurs during network spike between git and config server and all requests just standing in a line waiting the first one to fail on connection timeout, but event in this case retry return it to the line.
@tdanylchuk do you have a way to re-create the connection timeout issue? I'm not sure what you mean by your last sentence ("retry return it to the line"). Can you explain a bit? I guess maybe you just mean that retry is never successful because of some pathology in the network connection to git? Removing locks and making the retry faster wouldn't necessarily help, right?
@pankesh I don't know why you think it's anything other than latency at the end of the day. Your screenshots only demonstrate what we know: that only one request can be serviced at a time. Your assertion that the filesystem call "never returns" is hard to justify.
@dsyer yes, correct, it wouldn't faster since clients after failure will try another time with same issue. I was able so solve this only by disabling all client healthchecks and decreasing jgit timeout to 1 second. But unfortunately it's not the best way to solve the issue... For example if service is using cloud bus, it should be sure that its connection to config service is OK, so healthcheck should be enabled, but if spike occurs health will be down for all instances after retry exhausted.
@dsyer Agreed about the one request at a time in the current design. Regarding the screenshot - those thread analysis consists of 3 dumps collected 30 s apart and the particular thread that is holding the monitor never returned. Our latency back from our enterprise github is about 4s-6s so there should have been no reason for that to be held that long.
We know how to reproduce this in our environment so if there is anything additional that you'd like me to collect/capture, i can provide it here. Tx!
30s isn't all that long if there are network timeouts. You don't need hundreds of clients to see this behaviour - only 2 - if the git server is unresponsive and the network is in a pathological state. Your 4s "normal" latency is really sad, and it will inevitably lead to issues if many clients at once all need to do something with that level of delay.
We are using config-server + git as a storage for properties.
Time to time we are facing the case when git provider becomes turtle - config service stucks on giving properties to clients.
Some info regarding this:
The problem: if count of request (includes healthchecks and configs retrievings itself) will be huge and response time from git repo will be not that slow but bigger than expected we will have all thread stuck and config service will be unavailable due to absence of free threads on tomcat, it can cause all nodes DOWN on discovery server, so whole system would be unavailable.
Possible solutions:
synchronized
blocks are all around the code. (once it could lead to deadlock, since its very hard to track allsynchronized
blocks)So the question to
member
s: what is the best solution to solve this in your opinion?