Closed filpano closed 7 years ago
I have not heard anything like this before.
You say the problem is happening in SimpleHoustRoutingFilter
, are you not using Zuul + Service Discovery then in this app? I am curious if you used service discovery to route requests through Zuul if the same problem would happen. In that case SimpleHostRoutingFilter
would not be used.
If you would like to specify your own connection pool manager, please let us know what you find.
Subclassing SimpleHostRoutingFilter
is possible. We made newConnectionManager()
and newClient()
protected for that reason. You can use the RibbonRoutingFilter
by providing ribbon a static list of servers and see if that helps as well (no need to use Discovery).
Thank you very much for the responses.
@ryanjbaxter I'm not sure I understood you. I'm using service discovery (via Eureka
) to build the routes from my Zuul
services to my other services. I can see however, while debugging, that the SimpleHostRoutingFilter
is the one that is being used to forward the requests. Are you saying this should not be the case?
@spencergibb Thank you for the suggestion, I will give that a try as well and see if therea are any changes in the behaviour.
Could you show us your Zuul configuration?
Sure. My configuration is very simple, i.e. I only have some route forwarding configuration:
zuul:
routes:
backend-service:
path: /backend-service/v1/uri/**
stripPrefix: true
url: http://backend-service/v1/uri
In addition, I have an autoconfiguration class that autowires the following Bean
:
@Autowired
public ZuulAutoConfiguration(ConfigurableListableBeanFactory beanFactory, StandardEnvironment environment) {
this.environment = environment;
this.resolver = beanFactory.getBeanExpressionResolver();
this.context = new BeanExpressionContext(beanFactory, null);
environment.getPropertySources().addFirst(eurekaZonePropertySource()); // helper method that puts the active port into the property source
}
Unfortunately, subclassing SimpleHostRoutingFilter
did not have much of a positive effect. I attempted to add my own PoolingHttpClientConnectionManager
with setValidateAfterInactivity()
, and also tried to have make Runnable
in the background performing connection cleanup (see: http://hc.apache.org/httpcomponents-client-4.3.x/tutorial/html/connmgmt.html#d5e405)
In addition, I have noticed that when locally debugging, the call to future.get(timeout, tunit)
in PoolingHttpClientConnectionManager:276
is the latest point where a connection stuck in CLOSE_WAIT
will go back to the pool as a reused connection. Afterwards however, it will get stuck in CLOSE_WAIT
all over again instead of going into FIN_WAIT
and eventually being closed altogether.
Eventually the number of 'released' connections is smaller than the number of those stuck in CLOSE_WAIT
during the next release cycle and no more requests can get through.
I'll keep looking to a potential cause, but right now I'm admittedly a little confused as to the nature of this issue. I would very much appreciate if you could provide any suggestions based on the above!
EDIT:
To be absolutely sure, I retried a minimal (although not exact) replica Zuul
+ Eureka
+ Controller
setup, and I could not replicate the issue with the CLOSE_WAIT
connections. This leads me to believe that my stack may have some configuration set somewhere which is influencing or exacerbating this issue. I'll continue searching...
We are also having periodic outages where our systems seem to "go off the rails". We are still looking for a root cause, which may be a network disruption in our data center...but the resulting symptoms are the same:
We have a node.js web application that is talking to a spring-cloud-sidecar which is routing rest calls to our backend services.
We are seeing "To many open files" in our tomcat acceptor thread on both the backend web services AND in our sidecar logs.
We also have noticed that there are a very large number of sockets that appear to be stuck in "CLOSE_WAIT" and "TIME_WAIT" states.
Based on those symptoms we have been focusing on why we have so many open sockets.
One additional item, we are configuring our tomcat connector as follows:
server:
port: 8080
compression:
enabled: true
mime-types: application/json
min-response-size: 2048
tomcat:
# This is a "standard property" from spring cloud.
max-threads: 400
# These four additional properties were added via a TomcatConnectorCustomizer.
accept-count: 20
async-timeout-ms: 10000 #AsyncTimeout on the connector.
keepAliveTimeout: 20000 # KeepAliveTimeout on the nioHandler.
connection-timeout-ms: 20000 # Connection Timeout on the nioHandler
@filpano what connector settings are you using in your tomcat container? Are you using compression?
@filpano because you are setting the url
property for the route Zuul will use SimpleHostRoutingFilter
and use the service discovery information from Eureka. If you do not define the url
than Zuul will route the requests based on the information from Eureka.
So, one of our developer ran across this pull request that seems like a pretty good candidate for the sockets being left in CLOSE_WAIT state:
https://github.com/spring-cloud/spring-cloud-netflix/pull/1372
It was recently merged, but it looks like it is targeted for Dalston.
@ryanjbaxter @spencergibb Is there any chance this specific fix can also be back ported into the next Camden release? I think this is likely the source of our CLOSE_WAIT sockets in our production environment and I would prefer to have an official fix vs. having to disable those three filters and override them with filters that include this fix.
Thanks
I dont see why it couldnt be merged back.
@spencergibb any issues?
I think I was being cautious. Any chance you could test it out @tkvangorder before we back port?
@spencergibb @ryanjbaxter
See the attached commit from Pat. He back ported the change to 1.2.6 release. We did deploy this change into our staging environment and it does indeed fix the "CLOSE_WAIT" issue. We were monitoring the server's sockets in Grafana.
We shutdown the sidecar @ 10:02 and brought it back up (with Pat's Change at 10:10.
We do not really want to push this out to our production environment until we have an "official" release. Also, it would be nice to have you both review the changes to make sure we are not going to be impacting performance. Specifically, we want to make sure we are not going to change the behavior such that a new connection is made each time ?
Thanks
Thats great! I dont see anything off in the changes.
Excellent, what is the game plan for next Camden release? Will it be bumped up to use spring-cloud-netflix.1.2.7 (and if so, I guess we can wait for that release)? Or is there a branch that we can submit a pull request against? This issue is important to us, as it continues to impact our production environment. We are currently thinking about putting the patched version into our artifactory server and deploying that until there is a long term fix.
Yes the next Camden release should use 1.2.7
. You can submit a PR agains the 1.2.x branch if you would like. Let me know.
@ryanjbaxter I think we just need to cherry pick the change
created pull request Created a pull request https://github.com/spring-cloud/spring-cloud-netflix/pull/1845 for resolving this issue
@pway99 @tkvangorder I cherry picked the commit into 1.2.x
via commit 7eb6647. You should be able to use a Camden.BUILD-SNAPSHOT
to test out the fix.
Thanks Ryan
We have confirmed that the snapshot is working in our staging environment, do you have any timeframe when you expect to release Camden.SR7?
we'll talk about it tomorrow morning.
Hi Spencer, any word on SR7? Thanks
@tkvangorder we most likely try to get the release done next week.
I have a
Zuul
server that proxies all my requests to autodiscovered (via Eureka) routes.This works fine most of the time. However, I have noticed some very odd behaviour the occurs sporadically and can only partially be recreated.
After making multiple simultaneous requests, for example the
swagger-ui.html
page for a given API description, which loads not only the page itself but also numerous webjars and resources, some connections end up in aCLOSE_WAIT
state.The 4396 process in this case is my IDE with which I was debugging the
Zuul
server. When I perform another refresh of the same browser site, many of the connections are successfully closed, though some more pop up after a while. The behaviour also happens, although less frequently, when making numerouscURL
requests to any given route.I dug around in the
SimpleHostRoutingFilter
which uses aPoolingHttpClientConnectionManager
and noticed something peculiar:-1
, i.e. infiniteCLOSE_WAIT
get reused for establishing new connections in line 318 ofPoolingHttpClientConnectionManager
(something which I find extremely odd, but I am unsure if this might be a standard Java approach)However, there are some connections that live on eternally in a
CLOSE_WAIT
state that I cannot get rid of. The other end of the route does not have any open connections still lying around - it is merely theZuul
which is not successfully closing the connections in theCLOSE_WAIT
state.Eventually these connections clog up the pool and I stop getting responses from my services altogether, and I have not seen
Zuul
clean them up even after >1 day.What is odd as well is that, the cap seems to be
50
connections although themaxPerRoute
parameter is set to20
.Is this a known issue? Is there a workaround known? I was planning to subclass/replace the
SimpleHostRoutingFilter
with my own and pass it a connection pool manager configuration with some TTL value to see if there could be any improvements, but I thought I should first ask if this is a known issue seeing how the effort required is non-trivial.