Rate-Limiting errors with Golang context canceled and sharding sidecar errors with remote close

bdecoste commented 1 year ago

Gloo Edge Version

1.11.x

Kubernetes Version

No response

Describe the bug

With 1.11 we see the rate-limit server error with Golang context canceled and sharding sidecar error with remote close. This did not occur with 1.6.

{"level":"error","ts":1668692482.5380838,"logger":"rate-limiter","msg":"finished unary call with code Unknown","version":"undefined","grpc.start_time":"2022-11-17T13:41:22Z","grpc.request.deadline":"2022-11-17T13:41:22Z","system":"grpc","span.kind":"server","grpc.service":"envoy.service.ratelimit.v3.RateLimitService","grpc.method":"ShouldRateLimit","error":"Could not execute Redis pipeline: context canceled","errorVerbose":"Could not execute Redis pipeline\n\tgrpc.(*Server).handleStream:/go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1620\n\tgrpc.(*Server).processUnaryRPC:/go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1283\n\tv3._RateLimitService_ShouldRateLimit_Handler:/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.10.2-0.20220325020618-49ff273808a1/envoy/service/ratelimit/v3/rls.pb.go:987\n\tgo-grpc-middleware.ChainUnaryServer.func1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\n\tgo-grpc-middleware.ChainUnaryServer.func1.1.1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n\tzap.UnaryServerInterceptor.func1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:31\n\tgo-grpc-middleware.ChainUnaryServer.func1.1.1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n\thealthchecker.GrpcUnaryServerHealthCheckerInterceptor.func1:/go/pkg/mod/github.com/solo-io/go-utils@v0.21.26/healthchecker/grpc.go:69\n\tv3._RateLimitService_ShouldRateLimit_Handler.func1:/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.10.2-0.20220325020618-49ff273808a1/envoy/service/ratelimit/v3/rls.pb.go:985\n\tservice.(*service).ShouldRateLimit:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.6.2/pkg/service/ratelimit.go:210\n\tservice.(*service).shouldRateLimitWorker:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.6.2/pkg/service/ratelimit.go:169\n\tredis.(*rateLimitCacheImpl).DoLimit:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.6.2/pkg/cache/redis/cache_impl.go:134\n\tredis.pipelineFetch:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.6.2/pkg/cache/redis/cache_impl.go:46\n\tredis.(*connectionImpl).Pop:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.6.2/pkg/cache/redis/driver_impl.go:134\ncontext canceled","grpc.code":"Unknown","grpc.time_ms":137.706,"stacktrace":"github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngithub.com/envoyproxy/go-control-plane/envoy/service/ratelimit/v3._RateLimitService_ShouldRateLimit_Handler\n\t/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.10.2-0.20220325020618-49ff273808a1/envoy/service/ratelimit/v3/rls.pb.go:987\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:922"}

[2022-11-17 13:40:50.910][15][debug][connection] [external/envoy/source/common/network/connection_impl.cc:648] [C1095] remote close
[2022-11-17 13:40:50.910][15][debug][connection] [external/envoy/source/common/network/connection_impl.cc:249] [C1095] closing socket: 0
[2022-11-17 13:40:50.910][15][debug][conn_handler] [external/envoy/source/server/active_stream_listener_base.cc:120] [C1095] adding to cleanup list

Steps to reproduce the bug

Send concurrent load to rate-limit server

Expected Behavior

No errors as in 1.6 with the same number of rate-limit and redis replicas

Additional Context

Suspected cause: https://github.com/solo-io/rate-limiter/pull/113/files#diff-5027b485ec5ecd2570e59688d3afb5266ab3c697c6ce931cda8c4467cc03e7b1R7

SantoDE commented 1 year ago

Maybe an option would be to actually configure the timeout value.

bdecoste commented 1 year ago

What is the current impact of this behavior?

Small % of rate-limiting failures. This has caused the user to set denyOnFail: false so that these failed requests are allowed.

What environments was this experienced in?

3 tiers of production environments

What changes (new config, new services, different type of request, different frequency in request) caused this to occur and what data was gathered when triaging it in their env?

The errors started with an upgrade to GE 1.11, which is when we added the Golang Context (i.e. timeout) logic to rate-limiting.

Do we have sample steps to reproduce this using their configuration that can be included?

No, the user's prod config involves thousands of CRs and significant clients, services, etc.

Have you been able to recreate this?

Will work on a test case to recreate but I believe that just a high volume of traffic being rate-limited will trigger these timeouts.

sam-heilbron commented 1 year ago

The timeout that you're referring to (which is causing the golang context deadline), is your expectation that is a timeout occurring between the proxy and the rate limiting service?

Do we have any metrics from the 1.6 installation around how long rate limiting decisions took? And then can we have the comparative metrics from the 1.11 installation?

What is the desired behavior for these small number of requests that are currently timing out? Is it just to ensure that, even though they take longer to complete, that they successfully complete?

bdecoste commented 1 year ago

The error seems to indicate that the golang context deadline is expiring because the request to redis is taking too long.

The user isn't running 1.6 anymore and has 1.13 in all of their envs, but the throughput has certainly increased in that time.

Yes, to your third question. That was the behavior they saw in 1.6.

sam-heilbron commented 1 year ago

So in terms of timeouts, there are a few at play here, and most of them I believe are able to be configured. Ill admit that the discoverability of these could be improved, so Ill clarify them here:

Timeouts

Envoy -> Rate Limit Service

This is configured at the http filter level. This configured can be defined globally, in the Settings resource (https://docs.solo.io/gloo-edge/latest/reference/api/github.com/solo-io/gloo/projects/gloo/api/v1/enterprise/options/ratelimit/ratelimit.proto.sk/#settings) or on an individual Gateway. Either way, there is a requestTimeout which can be configured. The default timeout for Gloo Edge is 100ms (https://github.com/solo-io/gloo/blob/main/projects/gloo/pkg/plugins/ratelimit/plugin.go#L45)

Rate Limit Service -> Redis

The Redis connection pool has a number of parameters which can be configured by env variables (https://github.com/solo-io/rate-limiter/blob/master/pkg/cache/redis/driver_impl.go#L36), and default to 10 seconds if no configured:

"REDIS_DIAL_TIMEOUT"
"REDIS_READ_TIMEOUT"
"REDIS_POOL_TIMEOUT"
"REDIS_WRITE_TIMEOUT"

1.6 -> 1.11 Changes

The most notable changes in our RateLimitService, was using the go-redis client internally, instead of relying on the radix client. This change was made because go-redis is more actively maintained and contains a wider set of functionality (TLS support).

There is a chance that this internal library shift caused latency impact on the request path, but we have a set of benchmark tests design to catch that. Another potential reason for this behavioral change is due to https://github.com/mediocregopher/radix/issues/201. It seems that the request timeout can be ignored using the previous library. It could just be that we are correctly hitting timeouts (and presenting them as errors), when previously requests failed due to the same timeout, but the error wasn't presented as visibly in the RateLimitService

Observability

Given that modifying the failure mode configuration resolved the behavior, it may be worth investigating how frequently that case is hit. There are metrics that are exported by the filter: https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/rate_limit_filter#statistics, and one of them is a failure_mode_allowed counter.

github-actions[bot] commented 5 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.

solo-io / gloo