Synchronized blocks around the code causes thread blocks

tdanylchuk commented 6 years ago

We are using config-server + git as a storage for properties.

Time to time we are facing the case when git provider becomes turtle - config service stucks on giving properties to clients.

Some info regarding this:

can't say about particular count of nodes using config-service, but it's around 100+ and we expect more ...

here is the threaddump from config-server

one of the 201 threads stuck on fetch:

Thread 30550: (state = IN_NATIVE)
- java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise)
.........
- org.springframework.cloud.config.server.environment.JGitEnvironmentRepository.fetch(org.eclipse.jgit.api.Git, java.lang.String) @bci=35, line=398 (Compiled frame)
- org.springframework.cloud.config.server.environment.JGitEnvironmentRepository.refresh(java.lang.String) @bci=18, line=217 (Compiled frame)
- org.springframework.cloud.config.server.environment.JGitEnvironmentRepository.getLocations(java.lang.String, java.lang.String, java.lang.String) @bci=11, line=194 (Compiled frame)
- org.springframework.cloud.config.server.environment.MultipleJGitEnvironmentRepository.getLocations(java.lang.String, java.lang.String, java.lang.String) @bci=220, line=146 (Compiled frame)
- org.springframework.cloud.config.server.environment.AbstractScmEnvironmentRepository.findOne(java.lang.String, java.lang.String, java.lang.String) @bci=17, line=42 (Compiled frame)

slowest responses:

The problem: if count of request (includes healthchecks and configs retrievings itself) will be huge and response time from git repo will be not that slow but bigger than expected we will have all thread stuck and config service will be unavailable due to absence of free threads on tomcat, it can cause all nodes DOWN on discovery server, so whole system would be unavailable.

Possible solutions:

I was thinking to add some fail fast strategy and look at local repo instead on getting stuck for several minutes in case if response time will be huge, and made little refactoring as a first step - https://github.com/spring-cloud/spring-cloud-config/pull/952 but it solves not much, since we can't just fail in the middle of git flow, so we can break git in such case and such cases would be hard to cover by unit-tests.
also I have doubts regarding healthcheck flow, since it's harm flow of retrieving properties (which is main responsibility), but does not so much (if service started - properties should be read once and it won't get to config service another time except being refreshed)
have concurrent git repos for each JGitRepository and each repo will have its own lock. It also solves not that much, but throughput could be increased in X times, although it requires refactoring not only JGitRepository, but higher classes, since synchronized blocks are all around the code. (once it could lead to deadlock, since its very hard to track all synchronized blocks)

So the question to members: what is the best solution to solve this in your opinion?

ryanjbaxter commented 6 years ago

I would think this problem could be mitigated by scaling out the config server with a couple of instances. Are your above findings based on a single instance?

tdanylchuk commented 6 years ago

Clustering of config service may help, but is it not excessive? Each node would be just stuck and resources won't be used at all. We would have huge overhead on java/tomcat but useful processing time would be like <10%

ryanjbaxter commented 6 years ago

I dont understand why it would be excessive. This is one of the typical solutions to handle a lot of load. With the additional instances the app should not experience the problems you described.

dsyer commented 6 years ago

See also #6 (suggested alternative implementation)

pankesh commented 6 years ago

We ran into the similar issue and as by Ryan, were able to scale out to get around the limitation. Wanted to share what we noticed is that this is not simply a latency issue. The affected instance of the cloud config server doesn't recover once they get overwhelmed with requests unless restarted.

Below are 2 thread dumps showing all thread waiting on a monitor held by a thread doing a native file-system i/o which never returns.

Threads waiting for the lock

Thead holding the monitor

Environment:

3 instance of Spring Cloud Config running on PCF (each AI with 3Gb memory & 1Gb disk)
64 µservice scaled at 4 to 5 instances each
All config purely yaml files and about 5-10 k in size

spencergibb commented 6 years ago

And you have 250+ services hitting config server at the same time?

tdanylchuk commented 6 years ago

Confirm that config service just stuck and cannot process any requests w\o restart. And yes, there can be a case that even more service at the same time trying to get properties especially if clients healthcheck is enabled. In our case problem occurs during network spike between git and config server and all requests just standing in a line waiting the first one to fail on connection timeout, but event in this case retry return it to the line.

dsyer commented 6 years ago

@tdanylchuk do you have a way to re-create the connection timeout issue? I'm not sure what you mean by your last sentence ("retry return it to the line"). Can you explain a bit? I guess maybe you just mean that retry is never successful because of some pathology in the network connection to git? Removing locks and making the retry faster wouldn't necessarily help, right?

dsyer commented 6 years ago

@pankesh I don't know why you think it's anything other than latency at the end of the day. Your screenshots only demonstrate what we know: that only one request can be serviced at a time. Your assertion that the filesystem call "never returns" is hard to justify.

tdanylchuk commented 6 years ago

@dsyer yes, correct, it wouldn't faster since clients after failure will try another time with same issue. I was able so solve this only by disabling all client healthchecks and decreasing jgit timeout to 1 second. But unfortunately it's not the best way to solve the issue... For example if service is using cloud bus, it should be sure that its connection to config service is OK, so healthcheck should be enabled, but if spike occurs health will be down for all instances after retry exhausted.

pankesh commented 6 years ago

@dsyer Agreed about the one request at a time in the current design. Regarding the screenshot - those thread analysis consists of 3 dumps collected 30 s apart and the particular thread that is holding the monitor never returned. Our latency back from our enterprise github is about 4s-6s so there should have been no reason for that to be held that long.

We know how to reproduce this in our environment so if there is anything additional that you'd like me to collect/capture, i can provide it here. Tx!

dsyer commented 6 years ago

30s isn't all that long if there are network timeouts. You don't need hundreds of clients to see this behaviour - only 2 - if the git server is unresponsive and the network is in a pathological state. Your 4s "normal" latency is really sad, and it will inevitably lead to issues if many clients at once all need to do something with that level of delay.

spring-cloud / spring-cloud-config

Synchronized blocks around the code causes thread blocks #972