Open bdecoste opened 1 year ago
Maybe an option would be to actually configure the timeout value.
What is the current impact of this behavior?
Small % of rate-limiting failures. This has caused the user to set denyOnFail: false
so that these failed requests are allowed.
What environments was this experienced in?
3 tiers of production environments
What changes (new config, new services, different type of request, different frequency in request) caused this to occur and what data was gathered when triaging it in their env?
The errors started with an upgrade to GE 1.11, which is when we added the Golang Context (i.e. timeout) logic to rate-limiting.
Do we have sample steps to reproduce this using their configuration that can be included?
No, the user's prod config involves thousands of CRs and significant clients, services, etc.
Have you been able to recreate this?
Will work on a test case to recreate but I believe that just a high volume of traffic being rate-limited will trigger these timeouts.
The timeout that you're referring to (which is causing the golang context deadline), is your expectation that is a timeout occurring between the proxy and the rate limiting service?
Do we have any metrics from the 1.6 installation around how long rate limiting decisions took? And then can we have the comparative metrics from the 1.11 installation?
What is the desired behavior for these small number of requests that are currently timing out? Is it just to ensure that, even though they take longer to complete, that they successfully complete?
The error seems to indicate that the golang context deadline is expiring because the request to redis is taking too long.
The user isn't running 1.6 anymore and has 1.13 in all of their envs, but the throughput has certainly increased in that time.
Yes, to your third question. That was the behavior they saw in 1.6.
So in terms of timeouts, there are a few at play here, and most of them I believe are able to be configured. Ill admit that the discoverability of these could be improved, so Ill clarify them here:
This is configured at the http filter level. This configured can be defined globally, in the Settings resource (https://docs.solo.io/gloo-edge/latest/reference/api/github.com/solo-io/gloo/projects/gloo/api/v1/enterprise/options/ratelimit/ratelimit.proto.sk/#settings) or on an individual Gateway. Either way, there is a requestTimeout
which can be configured. The default timeout for Gloo Edge is 100ms (https://github.com/solo-io/gloo/blob/main/projects/gloo/pkg/plugins/ratelimit/plugin.go#L45)
The Redis connection pool has a number of parameters which can be configured by env variables (https://github.com/solo-io/rate-limiter/blob/master/pkg/cache/redis/driver_impl.go#L36), and default to 10 seconds if no configured:
The most notable changes in our RateLimitService, was using the go-redis client internally, instead of relying on the radix client. This change was made because go-redis is more actively maintained and contains a wider set of functionality (TLS support).
There is a chance that this internal library shift caused latency impact on the request path, but we have a set of benchmark tests design to catch that. Another potential reason for this behavioral change is due to https://github.com/mediocregopher/radix/issues/201. It seems that the request timeout can be ignored using the previous library. It could just be that we are correctly hitting timeouts (and presenting them as errors), when previously requests failed due to the same timeout, but the error wasn't presented as visibly in the RateLimitService
Given that modifying the failure mode
configuration resolved the behavior, it may be worth investigating how frequently that case is hit. There are metrics that are exported by the filter: https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/rate_limit_filter#statistics, and one of them is a failure_mode_allowed
counter.
This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.
Gloo Edge Version
1.11.x
Kubernetes Version
No response
Describe the bug
With 1.11 we see the rate-limit server error with Golang
context canceled
and sharding sidecar error withremote close
. This did not occur with 1.6.{"level":"error","ts":1668692482.5380838,"logger":"rate-limiter","msg":"finished unary call with code Unknown","version":"undefined","grpc.start_time":"2022-11-17T13:41:22Z","grpc.request.deadline":"2022-11-17T13:41:22Z","system":"grpc","span.kind":"server","grpc.service":"envoy.service.ratelimit.v3.RateLimitService","grpc.method":"ShouldRateLimit","error":"Could not execute Redis pipeline: context canceled","errorVerbose":"Could not execute Redis pipeline\n\tgrpc.(*Server).handleStream:/go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1620\n\tgrpc.(*Server).processUnaryRPC:/go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1283\n\tv3._RateLimitService_ShouldRateLimit_Handler:/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.10.2-0.20220325020618-49ff273808a1/envoy/service/ratelimit/v3/rls.pb.go:987\n\tgo-grpc-middleware.ChainUnaryServer.func1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\n\tgo-grpc-middleware.ChainUnaryServer.func1.1.1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n\tzap.UnaryServerInterceptor.func1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:31\n\tgo-grpc-middleware.ChainUnaryServer.func1.1.1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n\thealthchecker.GrpcUnaryServerHealthCheckerInterceptor.func1:/go/pkg/mod/github.com/solo-io/go-utils@v0.21.26/healthchecker/grpc.go:69\n\tv3._RateLimitService_ShouldRateLimit_Handler.func1:/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.10.2-0.20220325020618-49ff273808a1/envoy/service/ratelimit/v3/rls.pb.go:985\n\tservice.(*service).ShouldRateLimit:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.6.2/pkg/service/ratelimit.go:210\n\tservice.(*service).shouldRateLimitWorker:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.6.2/pkg/service/ratelimit.go:169\n\tredis.(*rateLimitCacheImpl).DoLimit:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.6.2/pkg/cache/redis/cache_impl.go:134\n\tredis.pipelineFetch:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.6.2/pkg/cache/redis/cache_impl.go:46\n\tredis.(*connectionImpl).Pop:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.6.2/pkg/cache/redis/driver_impl.go:134\ncontext canceled","grpc.code":"Unknown","grpc.time_ms":137.706,"stacktrace":"github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngithub.com/envoyproxy/go-control-plane/envoy/service/ratelimit/v3._RateLimitService_ShouldRateLimit_Handler\n\t/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.10.2-0.20220325020618-49ff273808a1/envoy/service/ratelimit/v3/rls.pb.go:987\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:922"}
Steps to reproduce the bug
Send concurrent load to rate-limit server
Expected Behavior
No errors as in 1.6 with the same number of rate-limit and redis replicas
Additional Context
Suspected cause: https://github.com/solo-io/rate-limiter/pull/113/files#diff-5027b485ec5ecd2570e59688d3afb5266ab3c697c6ce931cda8c4467cc03e7b1R7