knative-extensions / net-kourier

Purpose-built Knative Ingress implementation using just Envoy with no additional CRDs
Apache License 2.0
294 stars 82 forks source link

Envoy circuit breakers hit with HTTP/2 #839

Closed mbrancato closed 1 week ago

mbrancato commented 2 years ago

I use a custom client to send requests to Knative services with a Kourier gateway. It has a slow-start algorithm that gradually scales up concurrent requests until a configured limit is reached, or until error rates are too high. Under normal conditions, with HTTP/1.1, this can easily scale to 10k concurrent requests with just a few Kourier pods. Using HTTP/1.1 is more fragile as it could hit socket limits. However, if I switch that same config to HTTP/2, once the concurrency hits approximately 1000 (maybe slightly lower), Kourier / envoy responds with:

HTTP/2.0 503 Service Unavailable
Content-Length: 81
Content-Type: text/plain
Date: Tue, 10 May 2022 01:58:16 GMT
Server: envoy
X-Envoy-Overloaded: true

upstream connect error or disconnect/reset before headers. reset reason: overflow

If I reduce the concurrency, but increase the number of clients sending requests, once the total concurrent requests goes above 1000, the same result happens. This might depend on the number of upstream service pods, I was testing with just 2.

Is there a way to tune this setting in the Kourier config? Ideally, if requests per socket connection need to be limited, it would allow spawning more connections when the limit is reached.

edit: to clarify, the upstream Service is not changing here, just how the client connects. The upstream service port is defined as a named port (h2c).

mbrancato commented 2 years ago

As a follow-up, this is mainly overflowing the upstream_rq_pending_overflow value.

foo/myapp-00006.upstream_rq_pending_overflow: 17345

There are also some upstream_cx_overflow and upstream_rq_pending_overflow values > 0 on HTTP/1.1 services, but they are fairly small.

The dynamic clusters are getting default circuit breaker values.

% curl -s localhost:9000/clusters | grep myapp
foo/myapp-00006::observability_name::foo/myapp-00006
foo/myapp-00006::default_priority::max_connections::1024
foo/myapp-00006::default_priority::max_pending_requests::1024
foo/myapp-00006::default_priority::max_requests::1024
foo/myapp-00006::default_priority::max_retries::3
foo/myapp-00006::high_priority::max_connections::1024
foo/myapp-00006::high_priority::max_pending_requests::1024
foo/myapp-00006::high_priority::max_requests::1024
foo/myapp-00006::high_priority::max_retries::3
foo/myapp-00006::added_via_api::true
...

And looking where new clusters appear to be created via the API, there is no circuit breaker config: https://github.com/knative-sandbox/net-kourier/blob/969ae748c33768049c72b10ae9fb0cd0235fe6ff/pkg/envoy/api/cluster.go#L39-L52

As my testing has shown, the requests from multiple clients still triggers this overflow. Otherwise, I could create some transport pool to spread the load. I do think each individual Kourier instance should be able to handle > 1024 concurrent requests to a service.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

mbrancato commented 1 year ago

/remove-lifecycle stale

mbrancato commented 1 year ago

/reopen

knative-prow[bot] commented 1 year ago

@mbrancato: Reopened this issue.

In response to [this](https://github.com/knative-sandbox/net-kourier/issues/839#issuecomment-1256361482): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
mbrancato commented 1 year ago

I'm currently working around this with the following hack, but since it is not configurable it requires a custom build on my part:

        // HTTP/2 expanded max defaults
        cluster.CircuitBreakers = &envoyclusterv3.CircuitBreakers{
            Thresholds: []*envoyclusterv3.CircuitBreakers_Thresholds{
                {
                    Priority:           envoycorev3.RoutingPriority_DEFAULT,
                    MaxConnections:     &wrappers.UInt32Value{Value: 1024},
                    MaxPendingRequests: &wrappers.UInt32Value{Value: 1024*1024},
                    MaxRequests:        &wrappers.UInt32Value{Value: 1024*1024},
                    MaxRetries:         nil,
                    RetryBudget:        nil,
                    TrackRemaining:     false,
                    MaxConnectionPools: nil,
                },
                {
                    Priority:           envoycorev3.RoutingPriority_HIGH,
                    MaxConnections:     &wrappers.UInt32Value{Value: 1024},
                    MaxPendingRequests: &wrappers.UInt32Value{Value: 1024*1024},
                    MaxRequests:        &wrappers.UInt32Value{Value: 1024*1024},
                    MaxRetries:         nil,
                    RetryBudget:        nil,
                    TrackRemaining:     false,
                    MaxConnectionPools: nil,
                },
            },
        }
github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

knative-prow-robot commented 1 year ago

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

You can:

/lifecycle stale

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

mbrancato commented 1 year ago

/remove-lifecycle stale

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

mbrancato commented 11 months ago

/reopen

knative-prow[bot] commented 11 months ago

@mbrancato: Reopened this issue.

In response to [this](https://github.com/knative-extensions/net-kourier/issues/839#issuecomment-1752164776): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

mbrancato commented 8 months ago

/reopen

mbrancato commented 8 months ago

/remove-lifecycle stale

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

mbrancato commented 4 months ago

/remove-lifecycle stale

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.