spring-cloud / spring-cloud-netflix

Integration with Netflix OSS components
http://cloud.spring.io/spring-cloud-netflix/
Apache License 2.0
4.87k stars 2.44k forks source link

Zuul leaves some connections in CLOSE_WAIT state for further reuse, but some never get reused and stay in that state forever, blocking further requests #1785

Closed filpano closed 7 years ago

filpano commented 7 years ago

I have a Zuul server that proxies all my requests to autodiscovered (via Eureka) routes.

This works fine most of the time. However, I have noticed some very odd behaviour the occurs sporadically and can only partially be recreated.

After making multiple simultaneous requests, for example the swagger-ui.html page for a given API description, which loads not only the page itself but also numerous webjars and resources, some connections end up in a CLOSE_WAIT state.

tcp6       1      0 host:54470 host:37612 CLOSE_WAIT  user      425593599   4396/java
tcp6       1      0 host:57724 host:37612 CLOSE_WAIT  user      426384390   4396/java
tcp6       1      0 host:59402 host:52887 CLOSE_WAIT  user      425517966   4396/java
tcp6       1      0 host:59403 host:52887 CLOSE_WAIT  user      425489000   4396/java
tcp6       1      0 host:59404 host:52887 CLOSE_WAIT  user      425518687   4396/java
tcp6       1      0 host:59405 host:52887 CLOSE_WAIT  user      425469338   4396/java
tcp6       1      0 host:59406 host:52887 CLOSE_WAIT  user      425518688   4396/java
tcp6       1      0 host:59407 host:52887 CLOSE_WAIT  user      425476214   4396/java
tcp6       1      0 host:60118 host:37612 CLOSE_WAIT  user      426773630   4396/java
tcp6       1      0 host:60154 host:37612 CLOSE_WAIT  user      426810662   4396/java
tcp6       1      0 host:60155 host:37612 CLOSE_WAIT  user      426824573   4396/java
tcp6       1      0 host:60156 host:37612 CLOSE_WAIT  user      426821100   4396/java
tcp6       1      0 host:60157 host:37612 CLOSE_WAIT  user      426825547   4396/java
tcp6       1      0 host:60158 host:37612 CLOSE_WAIT  user      426820353   4396/java
tcp6       1      0 host:60159 host:37612 CLOSE_WAIT  user      426618721   4396/java
tcp6       1      0 host:60160 host:37612 CLOSE_WAIT  user      426802727   4396/java
tcp6       1      0 host:60161 host:37612 CLOSE_WAIT  user      426825548   4396/java
tcp6       1      0 host:60162 host:37612 CLOSE_WAIT  user      426824574   4396/java
tcp6       1      0 host:60163 host:37612 CLOSE_WAIT  user      426618722   4396/java
tcp6       1      0 host:60167 host:37612 CLOSE_WAIT  user      426689993   4396/java
tcp6       1      0 host:60168 host:37612 CLOSE_WAIT  user      426618745   4396/java
tcp6       1      0 host:60169 host:37612 CLOSE_WAIT  user      426796620   4396/java
tcp6       1      0 host:60170 host:37612 CLOSE_WAIT  user      426824617   4396/java
tcp6       1      0 host:60171 host:37612 CLOSE_WAIT  user      426827273   4396/java

The 4396 process in this case is my IDE with which I was debugging the Zuul server. When I perform another refresh of the same browser site, many of the connections are successfully closed, though some more pop up after a while. The behaviour also happens, although less frequently, when making numerous cURL requests to any given route.

I dug around in the SimpleHostRoutingFilter which uses a PoolingHttpClientConnectionManager and noticed something peculiar:

However, there are some connections that live on eternally in a CLOSE_WAIT state that I cannot get rid of. The other end of the route does not have any open connections still lying around - it is merely the Zuul which is not successfully closing the connections in the CLOSE_WAIT state.

Eventually these connections clog up the pool and I stop getting responses from my services altogether, and I have not seen Zuul clean them up even after >1 day.

What is odd as well is that, the cap seems to be 50 connections although the maxPerRoute parameter is set to 20.

Is this a known issue? Is there a workaround known? I was planning to subclass/replace the SimpleHostRoutingFilter with my own and pass it a connection pool manager configuration with some TTL value to see if there could be any improvements, but I thought I should first ask if this is a known issue seeing how the effort required is non-trivial.

ryanjbaxter commented 7 years ago

I have not heard anything like this before.

You say the problem is happening in SimpleHoustRoutingFilter, are you not using Zuul + Service Discovery then in this app? I am curious if you used service discovery to route requests through Zuul if the same problem would happen. In that case SimpleHostRoutingFilter would not be used.

If you would like to specify your own connection pool manager, please let us know what you find.

spencergibb commented 7 years ago

Subclassing SimpleHostRoutingFilter is possible. We made newConnectionManager() and newClient() protected for that reason. You can use the RibbonRoutingFilter by providing ribbon a static list of servers and see if that helps as well (no need to use Discovery).

filpano commented 7 years ago

Thank you very much for the responses.

@ryanjbaxter I'm not sure I understood you. I'm using service discovery (via Eureka) to build the routes from my Zuul services to my other services. I can see however, while debugging, that the SimpleHostRoutingFilter is the one that is being used to forward the requests. Are you saying this should not be the case?

@spencergibb Thank you for the suggestion, I will give that a try as well and see if therea are any changes in the behaviour.

ryanjbaxter commented 7 years ago

Could you show us your Zuul configuration?

filpano commented 7 years ago

Sure. My configuration is very simple, i.e. I only have some route forwarding configuration:

zuul:

  routes:
    backend-service:
      path: /backend-service/v1/uri/**
      stripPrefix: true
      url: http://backend-service/v1/uri

In addition, I have an autoconfiguration class that autowires the following Bean:

@Autowired
public ZuulAutoConfiguration(ConfigurableListableBeanFactory beanFactory, StandardEnvironment environment) {
    this.environment = environment;
    this.resolver = beanFactory.getBeanExpressionResolver();
    this.context = new BeanExpressionContext(beanFactory, null);
    environment.getPropertySources().addFirst(eurekaZonePropertySource()); // helper method that puts the active port into the property source
}

Unfortunately, subclassing SimpleHostRoutingFilter did not have much of a positive effect. I attempted to add my own PoolingHttpClientConnectionManager with setValidateAfterInactivity(), and also tried to have make Runnable in the background performing connection cleanup (see: http://hc.apache.org/httpcomponents-client-4.3.x/tutorial/html/connmgmt.html#d5e405)

In addition, I have noticed that when locally debugging, the call to future.get(timeout, tunit) in PoolingHttpClientConnectionManager:276 is the latest point where a connection stuck in CLOSE_WAIT will go back to the pool as a reused connection. Afterwards however, it will get stuck in CLOSE_WAIT all over again instead of going into FIN_WAIT and eventually being closed altogether.

Eventually the number of 'released' connections is smaller than the number of those stuck in CLOSE_WAIT during the next release cycle and no more requests can get through.

I'll keep looking to a potential cause, but right now I'm admittedly a little confused as to the nature of this issue. I would very much appreciate if you could provide any suggestions based on the above!

EDIT:

To be absolutely sure, I retried a minimal (although not exact) replica Zuul + Eureka + Controller setup, and I could not replicate the issue with the CLOSE_WAIT connections. This leads me to believe that my stack may have some configuration set somewhere which is influencing or exacerbating this issue. I'll continue searching...

tkvangorder commented 7 years ago

We are also having periodic outages where our systems seem to "go off the rails". We are still looking for a root cause, which may be a network disruption in our data center...but the resulting symptoms are the same:

We have a node.js web application that is talking to a spring-cloud-sidecar which is routing rest calls to our backend services.

We are seeing "To many open files" in our tomcat acceptor thread on both the backend web services AND in our sidecar logs.

We also have noticed that there are a very large number of sockets that appear to be stuck in "CLOSE_WAIT" and "TIME_WAIT" states.

Based on those symptoms we have been focusing on why we have so many open sockets.

One additional item, we are configuring our tomcat connector as follows:

server:
  port: 8080
  compression:
    enabled: true
    mime-types: application/json
    min-response-size: 2048
  tomcat:
    # This is a "standard property" from spring cloud.
    max-threads: 400
    # These four additional properties were added via a TomcatConnectorCustomizer.
    accept-count: 20
    async-timeout-ms: 10000 #AsyncTimeout on the connector.
    keepAliveTimeout: 20000 # KeepAliveTimeout on the nioHandler.
    connection-timeout-ms: 20000 # Connection Timeout on the nioHandler
tkvangorder commented 7 years ago

@filpano what connector settings are you using in your tomcat container? Are you using compression?

ryanjbaxter commented 7 years ago

@filpano because you are setting the url property for the route Zuul will use SimpleHostRoutingFilter and use the service discovery information from Eureka. If you do not define the url than Zuul will route the requests based on the information from Eureka.

tkvangorder commented 7 years ago

So, one of our developer ran across this pull request that seems like a pretty good candidate for the sockets being left in CLOSE_WAIT state:

https://github.com/spring-cloud/spring-cloud-netflix/pull/1372

It was recently merged, but it looks like it is targeted for Dalston.

@ryanjbaxter @spencergibb Is there any chance this specific fix can also be back ported into the next Camden release? I think this is likely the source of our CLOSE_WAIT sockets in our production environment and I would prefer to have an official fix vs. having to disable those three filters and override them with filters that include this fix.

Thanks

ryanjbaxter commented 7 years ago

I dont see why it couldnt be merged back.

@spencergibb any issues?

spencergibb commented 7 years ago

I think I was being cautious. Any chance you could test it out @tkvangorder before we back port?

tkvangorder commented 7 years ago

@spencergibb @ryanjbaxter

See the attached commit from Pat. He back ported the change to 1.2.6 release. We did deploy this change into our staging environment and it does indeed fix the "CLOSE_WAIT" issue. We were monitoring the server's sockets in Grafana.

We shutdown the sidecar @ 10:02 and brought it back up (with Pat's Change at 10:10.

close_wait

socket_stats

We do not really want to push this out to our production environment until we have an "official" release. Also, it would be nice to have you both review the changes to make sure we are not going to be impacting performance. Specifically, we want to make sure we are not going to change the behavior such that a new connection is made each time ?

Thanks

ryanjbaxter commented 7 years ago

Thats great! I dont see anything off in the changes.

tkvangorder commented 7 years ago

Excellent, what is the game plan for next Camden release? Will it be bumped up to use spring-cloud-netflix.1.2.7 (and if so, I guess we can wait for that release)? Or is there a branch that we can submit a pull request against? This issue is important to us, as it continues to impact our production environment. We are currently thinking about putting the patched version into our artifactory server and deploying that until there is a long term fix.

ryanjbaxter commented 7 years ago

Yes the next Camden release should use 1.2.7. You can submit a PR agains the 1.2.x branch if you would like. Let me know.

spencergibb commented 7 years ago

@ryanjbaxter I think we just need to cherry pick the change

pway99 commented 7 years ago

created pull request Created a pull request https://github.com/spring-cloud/spring-cloud-netflix/pull/1845 for resolving this issue

ryanjbaxter commented 7 years ago

@pway99 @tkvangorder I cherry picked the commit into 1.2.x via commit 7eb6647. You should be able to use a Camden.BUILD-SNAPSHOT to test out the fix.

pway99 commented 7 years ago

Thanks Ryan

tkvangorder commented 7 years ago

We have confirmed that the snapshot is working in our staging environment, do you have any timeframe when you expect to release Camden.SR7?

spencergibb commented 7 years ago

we'll talk about it tomorrow morning.

tkvangorder commented 7 years ago

Hi Spencer, any word on SR7? Thanks

ryanjbaxter commented 7 years ago

@tkvangorder we most likely try to get the release done next week.