rackerlabs / repose

The powerful, programmable, API Middleware Platform
http://www.openrepose.org/
Other
338 stars 103 forks source link

Repose Threads are getting choked when the downstream is latent #1712

Open rgplvr opened 7 years ago

rgplvr commented 7 years ago

Hi All, we have observed that the repose http connection threads are getting choked while the downstreams are getting latent and eventually repose reaching a non responsive state, Any one have faced the similar situation? is there any plan to move the repose downstream connections to Async IO so that repose wont run out of threads, due to the thread waiting in Network IO PS: we have reduced the time out in the HttpPool config but since there are expected latent service this is not solving the issue for us. Please suggest

Regards, Raj

wdschei commented 7 years ago

@rajagopalvreghunath ,

I'm not sure I completely understand what you're having trouble with. Repose does NOT process connections to the Origin Service in an Asynchronous manner. That is to say, Repose doesn't wait for the return from the Origin Service call before processing other requests. If you run out of available connections in the pool, then you might see this behavior. If that is what is going on, then you can increase the size of the pool if you have enough resources available.

Can you give us a little more information as to what you are experiencing when you observe this behavior?

Kindest regards, Bill

UPDATE: Edited to correct this response.

rgplvr commented 7 years ago

Hi Bill , We are using the Repose with the following configs ie

  1. In default pool of 400 per route 10000 max,
  2. The Number of connections made the repose is 146k
  3. The origin service calls are heavily depending on the time but the virtually its around 30 % per downstream and there are multiple services
  4. Yes we were able to access the origin service when this issue was observed
  5. We are using mesos and the healthchecks are getting 503 there by the containers are moved out of rotation.

Also we are using 7.3.3.2 version of repose is this version also handles the connections in Asynchronous manner? or should we upgrade to get the same ?

Regards, Raj

wdschei commented 7 years ago

Raj,

That pool per route and max settings seem ok.

What are the simultaneous connections being made to any single Repose instance? Are the per route and max being reached on any single Repose instance at any given time? If so, then does it that instance recover and begin servicing requests again?

Since you are using the containerized flavor, would I be correct in assuming that you are not doing any rate limiting with Repose? Is the Mesos healthcheck just a call through Repose to an endpoint on the origin service? When you observe the behavior with the 503 responses, is the Repose container fully up? I ask this because Repose will respond with a 503 until it is ready to process requests.

If you could send us the full sanitized configs (e.g. no passwords or other sensitive data), any logs that were captured, and a basic diagram of what/how you are testing to ReposeCore AT Rackspace.com we would be happy to take a closer look.

Kindest Regards, Bill

rgplvr commented 7 years ago

Hi Bill, The configs are mostly of the custom filters we have written and this wont give much info to you,

<pool id="default"
          default="true"
          chunked-encoding="false"
          http.conn-manager.max-total="10000"
          http.conn-manager.max-per-route="400"
          http.socket.timeout="2000"
          http.socket.buffer-size="8192"
          http.connection.timeout="2000"
          http.connection.max-line-length="8192"
          http.connection.max-header-count="100"
          http.connection.max-status-line-garbage="100"
          http.tcp.nodelay="true"
          keepalive.timeout="0">
        </pool> 

This is the Http Pool config

And about the questions

  1. we are not using any rate limiting since its containers.
  2. And the mesos health check is failing in the situations when there are thread chokes. and eventually its been taken out of the rotation The containers are not getting recovered owing to this.

Please let me know if repose is using async io for downstreams since as per the codes it look though its making a pooled http they are not async . In short the code look to me like the threads are in IO waiting till the origin service responds to repose is this true? (PS : We are using 7.3.3.2)

Regards, Raj

wdschei commented 7 years ago

Raj,

I misspoke on my original response. We do NOT process requests to the Origin Service in an Asynchronous manner. We do use Akka calls from filters when making requests to other services, but that is used mainly for caching. The servicing thread still waits for the internal request to return before continuing to process the original request. I have updated/edited my original response to reflect this.

Repose uses the model of one thread per incoming request. If more simultaneous connections are requested than there are threads in the pools, then Repose will not service new requests. This is the way JEE containers work and we have bound Repose to that model in order to make it easier for custom filters to be written. That said, there is no plan at this time to support asynchronous connections to the Origin Service.

Repose was not originally designed to be used in 12 Factor environments, but we are beginning to migrate the internals to at least support it. Some of the features that are forthcoming are a health check endpoint and the ability to use a single Repose instance as the Remote Datastore for other Repose instances. This capability will allow Rate Limiting to be used even in dynamically maintained containerized environments like Mesos and OpenShift.

So to answer the original question, when the currently configured Repose instances begin to choke on to many simultaneous connections due to latency in the Origin Service, then either the size of the default connection pool or the number of Repose instances need to be increased to support the load. The route chosen here would depend on the resources available in your containerized environment. Since you are currently not relying on any of the Distributed Datastore capabilities, as they wouldn't work in a 12 Factor environment anyway, horizontally scaling the Repose layer should be fairly trivial if support for increasing the pools is not available.

Kindest Regards, Bill