Redis Rate Limiter works abnormally when multiple instances of gateway

zzyymaggie commented 5 years ago

I can reproduce issue #351. config as below:

spring:
  cloud:
      gateway:
        routes:
          - id:  api-c
            uri: lb://hello-service
            predicates:
            - Path=/api-c/**
            filters:
            - name: RequestRateLimiter
              args:
                key-resolver: '#{@remoteAddrKeyResolver}'
                redis-rate-limiter.replenishRate: 8
                redis-rate-limiter.burstCapacity: 10
  redis:
    host: localhost
    port: 6379

case1: 1.I deployed for server.port=1107 instance 2.I sent 30 requests in 1s.

for (( i = 0; i < 30; i=(i+1))); do
curl -I http://localhost:1107/api-c/
done

3.The result is expected. RedisRateLimter works well. The requests from 11th to 30th response 409. The value of "X-RateLimit-Remaining" is decreased from 9 to 0.

case2: 1.I deployed for server.port=1107 and server.port=1108 instances 2.I sent 30 requests for the 2 instances in 1s.

for (( i = 0; i < 30; i=(i+1))); do
curl -I http://localhost:1107/api-c/
curl -I http://localhost:1108/api-c/
done

3.Result is not expected. RedisRateLimter doesn't work well. There are few requests responsed 409. The value of "X-RateLimit-Remaining" is not decreased, but 8 or 9 as below.

HTTP/1.1 200 OK
transfer-encoding: chunked
X-RateLimit-Remaining: 8
X-RateLimit-Burst-Capacity: 10
X-RateLimit-Replenish-Rate: 8
Content-type: text/html;charset=utf-8
Content-Length: 0

HTTP/1.1 200 OK
transfer-encoding: chunked
X-RateLimit-Remaining: 9
X-RateLimit-Burst-Capacity: 10
X-RateLimit-Replenish-Rate: 8
Content-type: text/html;charset=utf-8
Content-Length: 0

HTTP/1.1 200 OK
transfer-encoding: chunked
X-RateLimit-Remaining: 8
X-RateLimit-Burst-Capacity: 10
X-RateLimit-Replenish-Rate: 8
Content-type: text/html;charset=utf-8
Content-Length: 0

ryanjbaxter commented 5 years ago

I am not sure that #{@remoteAddrKeyResolver} is actually unique enough when you have two instances running on the same machine. It might not include the port as you would expect.

spring-projects-issues commented 5 years ago

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

zzyymaggie commented 5 years ago

Thanks for your reply. i'm looking forward to your research. RemoteAddrKeyResolver.java is below.

public class RemoteAddrKeyResolver implements KeyResolver {
    public static final String BEAN_NAME = "remoteAddrKeyResolver";

    @Override
    public Mono<String> resolve(ServerWebExchange exchange) {
        return Mono.just(exchange.getRequest().getRemoteAddress().getAddress().getHostAddress());
    }

}

The unique key is the client's ip. All requests were from one client(for example its' ip is 1.2.3.4) and I confirmed it in redis. client1(1.2.3.4)--->server1<-->redis1 client1(1.2.3.4)--->server2<-->redis1

ryanjbaxter commented 5 years ago

So you have two instances of the gateway deployed both pointing to the same redis instance? Is there a load balancer sitting infront of the gateway instances?

zzyymaggie commented 5 years ago

So you have two instances of the gateway deployed both pointing to the same redis instance? [zzymaggie]Yes, 1107 and 1108 instance. Is there a load balancer sitting infront of the gateway instances? [zzyymaggie]No. Besides, I added a nginx in front of the two gateway instances, the result was the same with before.

ryanjbaxter commented 5 years ago

Can you provide a complete, minimal, verifiable sample that reproduces the problem? It should be available as a GitHub (or similar) project or attached to this issue as a zip file.

zzyymaggie commented 5 years ago

I have updated the demo in my github. springcloud-gateway-demo Please refer to it. Thanks a lot.

ryanjbaxter commented 5 years ago

I would imagine there is some level of race condition here on both gateways trying to access and update information in redis. As one server is handling a request another can come into the other server around the same time before the first updates the information in redis, hence you might see stuff like this

HTTP/1.1 404 Not Found
transfer-encoding: chunked
X-RateLimit-Remaining: 9
X-RateLimit-Burst-Capacity: 10
X-RateLimit-Replenish-Rate: 8
Content-Type: application/json;charset=UTF-8
Date: Tue, 25 Jun 2019 23:40:46 GMT
Content-Length: 0

HTTP/1.1 404 Not Found
transfer-encoding: chunked
X-RateLimit-Remaining: 9
X-RateLimit-Burst-Capacity: 10
X-RateLimit-Replenish-Rate: 8
Content-Type: application/json;charset=UTF-8
Date: Tue, 25 Jun 2019 23:40:46 GMT
Content-Length: 0

zzyymaggie commented 5 years ago

I would imagine there is some level of race condition here on both gateways trying to access and update information in redis. As one server is handling a request another can come into the other server around the same time before the first updates the information in redis, hence you might see stuff like this
HTTP/1.1 404 Not Found
transfer-encoding: chunked
X-RateLimit-Remaining: 9
X-RateLimit-Burst-Capacity: 10
X-RateLimit-Replenish-Rate: 8
Content-Type: application/json;charset=UTF-8
Date: Tue, 25 Jun 2019 23:40:46 GMT
Content-Length: 0

HTTP/1.1 404 Not Found
transfer-encoding: chunked
X-RateLimit-Remaining: 9
X-RateLimit-Burst-Capacity: 10
X-RateLimit-Replenish-Rate: 8
Content-Type: application/json;charset=UTF-8
Date: Tue, 25 Jun 2019 23:40:46 GMT
Content-Length: 0

Yes, I think so. So if we could distinguish different clients in request_rate_limiter.lua to avoid race condition. Looking forward to your reply.

TYsewyn commented 5 years ago

FWIW it's indeed a race condition. The script will retrieve the stored amount of available requests, do some computation and then update the result which is more error prone to multiple threads. Redis supports the INCR command to increment the value on the database and return the newly incremented value. Looking at the docs it has some examples and the rate limiter pattern is one of them.

I'll try to come up with a usable alternative using this command in order to have some better - not flawless - handling of multiple threads/instances.

zzyymaggie commented 5 years ago

@TYsewyn Thanks for your research. Looking forward to your new solution.

TYsewyn commented 5 years ago

Ok, so I stand corrected. After some researching and having discussions with others this is not a race condition. The Lua script is executed on the server in an atomic way, meaning that all commands from all connections/clients are on hold until the complete Lua script has been executed.

The problem resides in the provided bash script. It's waiting until you got a response from the downstream service - time that will also be used by the rate limiter to catch up - so you're not really stress/load testing it.

I cloned your repository and ran some tests myself using Gatling (could be any other load testing tool).

The results I got when we have a burst of 10 RPS and a fill rate of 8 RPS (or 1 additional request each 125ms):

30 RPS, no delay: first 10 requests are allowed, other 20 are not
30 RPS, 50ms delay between requests: at least 10 are succeeding, with the first 10 being a certainty
30 RPS, 200ms delay between requests: all request are allowed

You also need to take latency into account, as well as execution time. Consider the following setup: Browser --> nginx --> SCG If you're testing this flow with the configuration above (10RPS burst, 8RPS replenish) and you're having 50ms between each request, it might be that the 11th call also succeeds. That's because the delay between the 2 requests will be 50ms + latency browser -> nginx + execution time nginx (eg. load balancing) + latency nginx -> SCG + execution time SCG (eg. some filters which might be applied before the rate limiting downstream).

The current implementation of the rate limiter is based on the amount of seconds between 2 calls, and not the amount of milliseconds. This also creates an impact. Although the difference in time between 2 calls might only be 70ms, if the amount of seconds since the epoch changes, the replenishment will kick in and you might see a different result than what you would expect.

I hope this clarifies the behaviour that you've seen. If you still have questions about this, don't hesitate to reach out! :)

zzyymaggie commented 5 years ago

@TYsewyn Thanks for your detail research and descritption. I knew the root cause and verified again and found the limit rate was right. Thanks again.

for (( i = 0; i < 30; i=(i+1))); do
curl -I http://localhost:1107/api-c/ &
curl -I http://localhost:1108/api-c/ &
done

spring-cloud / spring-cloud-gateway

Redis Rate Limiter works abnormally when multiple instances of gateway #1029