Open skyrocknroll opened 5 years ago
Hey there, We wanted to check in on this request since it has been inactive for at least 60 days. If you think this is still an important issue in the latest version of Consul or its documentation please reply with a comment here which will cause it to stay open for investigation. If there is still no activity on this issue for 30 more days, we will go ahead and close it.
Feel free to check out the community forum as well! Thank you!
Hi,
we set this timeout using a service router:
Kind = "service-router"
Name = "myservice"
Routes = [
{
Destination {
Service = ""
RetryOnConnectFailure = true
RequestTimeout = "120s"
}
}
]
which can then be seeing in the envoy configuration:
"cluster": "myservice.default.dc1.internal.1af17f86-a6e7-28b2-3176-5a22dfc90678.consul",
"timeout": "120s",
"retry_policy": {
"retry_on": "connect-failure"
}
}
However, envoy sets this header:
x-envoy-expected-rq-timeout-ms: 15000
(see https://www.envoyproxy.io/docs/envoy/v1.11.1/configuration/http_filters/router_filter.html?highlight=timeout#x-envoy-expected-rq-timeout-ms)
which seems to cause the upstream envoy to give up after 15 seconds.
We couldn't find a way to increase the value of this header ( or disable it ) via consul which would leave us with the configured timeout in the service router. As a result we sporadically get a 504 Gateway Timeout for some long running requests.
At this point we are not sure if there is way to manage that header and we missed it or if there is still no way to manage it, in which case it would be a very helpful feature.
We are using consul 1.6.1 with envoy 1.11.1
If you could shed some light on this that would help us a lot Thanks!
I have the same issue and i think this is because envoy proxy on the local app dont set a specific timeout.
Your envoy proxy in front of the target app set the default timeout for all requests it sends to his local app.
To avoid this problem, the only solution i found was to override the public listener with envoy_public_listener_json
config key
Hi @rrondeau , many thanks for the hint. Would it be possible to get an example snippet of the configuration you used? Thanks!
i faced the same issue. I end up to build consul from source code with my fix. hope you found it useful
diff --git a/agent/xds/listeners.go b/agent/xds/listeners.go
index b44cd996c..67805ec29 100644
--- a/agent/xds/listeners.go
+++ b/agent/xds/listeners.go
@@ -9,6 +9,7 @@ import (
"regexp"
"strconv"
"strings"
+ "time"
envoy "github.com/envoyproxy/go-control-plane/envoy/api/v2"
envoyauth "github.com/envoyproxy/go-control-plane/envoy/api/v2/auth"
@@ -727,6 +728,7 @@ func makeHTTPFilter(
ClusterSpecifier: &envoyroute.RouteAction_Cluster{
Cluster: cluster,
},
+ Timeout: addrTime(5*time.Minute),
},
},
}
@@ -844,3 +846,7 @@ func makeCommonTLSContext(cfgSnap *proxycfg.ConfigSnapshot) *envoyauth.CommonTls
},
}
}
+
+func addrTime(t time.Duration) *time.Duration {
+ return &t
+}
diff --git a/agent/xds/routes.go b/agent/xds/routes.go
index 520545bdc..9cadaa2b1 100644
--- a/agent/xds/routes.go
+++ b/agent/xds/routes.go
@@ -4,6 +4,7 @@ import (
"errors"
"fmt"
"strings"
+ "time"
"github.com/gogo/protobuf/proto"
@@ -317,6 +318,7 @@ func makeRouteActionForSingleCluster(targetID string, chain *structs.CompiledDis
ClusterSpecifier: &envoyroute.RouteAction_Cluster{
Cluster: clusterName,
},
+ Timeout: addrTime(5*time.Minute),
},
}
}
@@ -353,6 +355,7 @@ func makeRouteActionForSplitter(splits []*structs.DiscoverySplit, chain *structs
TotalWeight: makeUint32Value(10000), // scaled up 100%
},
},
+ Timeout: addrTime(5*time.Minute),
},
}, nil
}
@msuarezd, I believe the issue you're describing is related to https://github.com/envoyproxy/envoy/issues/7358 which was fixed in Envoy version 1.12.0 with https://github.com/envoyproxy/envoy/pull/8051. If that's correct, the correct fix would be for Consul to also allow configuring respect_expected_rq_timeout in the envoy.router
HTTP filter config on the destination sidecars.
Has there been any progress on this issue? I'm extremely surprised to find there's no way to easily change request timeouts when using Consul Connect, this seems pretty critical?
Is setting RequestTimeout in a Service Resolver configuration the expected way to configure this? And it just doesn't work because of the respect_expected_rq_timeout issue mentioned above?
I'm not super familiar with Envoy itself and the documentation around the Envoy escape hatch configuration is hard to follow. Are these options the expected way to configure a request timeout? If so are there any examples of how to do this? Particularly when using the consul-k8s injector.
Wew, this has been a bit of a rabbit hole but for anyone else who runs into this issue...
The quickest fix is to change the destination service protocol to tcp
, this obviously means you can't use any of the Connect L7 features but at least your application level timeouts will work.
edit: this does seem to have changed all the upstream listeners to TCP proxies even though only 1 of the services is TCP and the rest are HTTP, a different bug maybe?
Setting a service-router
configuration with a RequestTimeout
should work.
It injects a timeout
configuration into the envoy route configuration as @msuarezd pointed out.
This had no effect for me, couldn't find a timeout parameter in the config dump on either envoy instance.
Reasonably sure this is me doing something wrong but I didn't dig too hard given the respect_expected_rq_timeout
issue means this doesn't work properly.
I had a look at enabling respect_expected_rq_timeout
in the envoy.router
config but that's not an easy fix either.
The go-control-plane
dependency doesn't support that configuration until v0.9.1
but v0.9.0
is a major breaking change and looks like it would need quite a few changes in the Consul codebase to support, or some hacking around to force that parameter.
Hi,
Same problem here, 15 seconds is too short for some of our workload.
Moreover, we are using Connect L7 feature so we can not set endpoint to tcp
.
I am really interested by @rrondeau solution. Did you have any example ?
I suppose that you override the 'listener_filters_timeout' value ? Is it possible to let all the other keys unset ?
Yours faithfully, LCDP
Today I struggled into this issue and these are my conclusions at this moment with Consul 1.9.1 and Envoy 1.16.0:
The RequestTimeout
at service-router config is not enough (is required so other complimentary workarounds works).
Then I tried the change suggested in
@msuarezd, I believe the issue you're describing is related to envoyproxy/envoy#7358 which was fixed in Envoy version 1.12.0 with envoyproxy/envoy#8051. If that's correct, the correct fix would be for Consul to also allow configuring respect_expected_rq_timeout in the
envoy.router
HTTP filter config on the destination sidecars.
So I recompiled Consul with:
HttpFilters: []*envoyhttp.HttpFilter{
{
Name: "envoy.router",
ConfigType: &envoyhttp.HttpFilter_Config{Config: &pbstruct.Struct{
Fields: map[string]*pbstruct.Value{
"respect_expected_rq_timeout": {
Kind: &pbstruct.Value_BoolValue{
BoolValue: true,
},
},
},
}},
},
},
so all envoy.router entries now have the respect_expected_rq_timeout
set to true
(they appear when dumping envoy config) but it stills timing out at 15s even when the service-router has the timeout at 120s
i faced the same issue. I end up to build consul from source code with my fix. hope you found it useful
diff --git a/agent/xds/listeners.go b/agent/xds/listeners.go index b44cd996c..67805ec29 100644 --- a/agent/xds/listeners.go +++ b/agent/xds/listeners.go @@ -9,6 +9,7 @@ import ( "regexp" "strconv" "strings" + "time" envoy "github.com/envoyproxy/go-control-plane/envoy/api/v2" envoyauth "github.com/envoyproxy/go-control-plane/envoy/api/v2/auth" @@ -727,6 +728,7 @@ func makeHTTPFilter( ClusterSpecifier: &envoyroute.RouteAction_Cluster{ Cluster: cluster, }, + Timeout: addrTime(5*time.Minute), }, }, } @@ -844,3 +846,7 @@ func makeCommonTLSContext(cfgSnap *proxycfg.ConfigSnapshot) *envoyauth.CommonTls }, } } + +func addrTime(t time.Duration) *time.Duration { + return &t +}
diff --git a/agent/xds/routes.go b/agent/xds/routes.go index 520545bdc..9cadaa2b1 100644 --- a/agent/xds/routes.go +++ b/agent/xds/routes.go @@ -4,6 +4,7 @@ import ( "errors" "fmt" "strings" + "time" "github.com/gogo/protobuf/proto" @@ -317,6 +318,7 @@ func makeRouteActionForSingleCluster(targetID string, chain *structs.CompiledDis ClusterSpecifier: &envoyroute.RouteAction_Cluster{ Cluster: clusterName, }, + Timeout: addrTime(5*time.Minute), }, } } @@ -353,6 +355,7 @@ func makeRouteActionForSplitter(splits []*structs.DiscoverySplit, chain *structs TotalWeight: makeUint32Value(10000), // scaled up 100% }, }, + Timeout: addrTime(5*time.Minute), }, }, nil }
but changed to work with latest consul changes:
diff --git a/agent/xds/listeners.go b/agent/xds/listeners.go
index dcf8f76d2..c7986e564 100644
--- a/agent/xds/listeners.go
+++ b/agent/xds/listeners.go
@@ -24,6 +24,7 @@ import (
"github.com/golang/protobuf/proto"
pbtypes "github.com/golang/protobuf/ptypes"
"github.com/golang/protobuf/ptypes/any"
+ "github.com/golang/protobuf/ptypes/duration"
pbstruct "github.com/golang/protobuf/ptypes/struct"
"github.com/golang/protobuf/ptypes/wrappers"
"github.com/hashicorp/consul/agent/connect"
@@ -1213,6 +1214,7 @@ func makeHTTPFilter(opts listenerFilterOpts) (*envoylistener.Filter, error) {
ClusterSpecifier: &envoyroute.RouteAction_Cluster{
Cluster: opts.cluster,
},
+ Timeout: &duration.Duration{Seconds: 120},
},
},
}
the problem is that it hardcodes a value for all the services, imposible to configure for each upstream service...
I suspect the way to go would be using the respect_expected_rq_timeout
but I couldn't make it work
Hey folks - we ran into this today too. I spent some time on it this afternoon.
As y'all have discovered, there's two places the Envoy route request_timeout
is tripping:
Locally, I have the changes which add local_request_timeout_ms
to proxy configurations, similar to how local_connect_timeout_ms
can be configured for the local app/public listener. In theory this also will let the same configuration be set at the proxy-defaults
level. Both 0s
(timeouts disabled, for streaming services), and the original behaviour (unspecified, will result in Envoy using its default) are supported. I'll push the branch once I've done a little more testing, and open a pull request.
Update: PR is here - https://github.com/hashicorp/consul/pull/9554
After addressing that, I sat down to see what we could do about the timeout on the other end, which is configurable today only via a service-router
configuration entry. I started to build in support for request_timeout_ms
along side connect_timeout_ms
in proxy upstreams because we don't have a service-router defined for each service, but then noticed this in the docs:
Note: The connection timeout for a service should ideally be configured via the connect_timeout field of a service-resolver config entry for the upstream destination service. from: https://www.consul.io/docs/connect/proxies/envoy#connect_timeout_ms
I'm not sure of HashiCorp's motivation behind that move or change is, but my assumption is that it's probably to centralise the configuration of connection timeouts, and elevate them "above" service upstream definitions, which left me wondering what to do about an upstream request_timeout
setting. The way I see it, there are a couple of options:
service-router ServiceRouteDestination.RequestTimeout
The work is already done to allow the timeout to be configurable here, and I believe the intent behind putting it in a service resolver is to allow per HTTP path customisations of timeouts, etc. The downside is that this would mean that any time you wanted to change the default Envoy timeout value of 15s, you would need to define a service-router
for your service.
request_timeout_ms
for Proxy UpstreamsPer the note from the docs above, and I could do with some guidance or directional advice from our friends at HashiCorp here, this doesn't seem to be the suggested place for configuring this type of thing going forward.
If my misunderstanding about this is incorrect, and a configuration under proxy upstreams for the request timeout is the right place, then great!
Maybe an option here is to explicitly set request_timeout = 0
for the upstream route configuration: you can already override it to set an explicit request timeout using a service-router, and with local_app's request timeout now being configurable, is there any downside to this?
I was about to pick this up next, but I have the same questions: where would you expect to configure this? Are there any places that you would reasonably expect this option to be enabled? Do we not care as much if we're able to tweak the timeouts above?
Appreciate thoughts and feedback! 🙂
For those that are interested, the pull request which adds support for local_request_timeout_ms
can be found here: https://github.com/hashicorp/consul/pull/9554 (updated in the above comment)
If you're after a complete change set that also disables the request timeout on the egress Envoy instances/upstream routes completely (but still allows overrides via service-router entries) because you want to control timeouts at the local_app
layer, then we have a custom fork of v1.9.1 we're currently testing which you can find here: https://github.com/bigcommerce/consul/compare/v1.9.1...v1.9.1-bc -- from limited testing we've done so far, this seems to be working fine.
I'd very much love feedback on my earlier comment for how we can get this functionality into Consul Connect, as I'd consider it critical for a lot of production deployments.
Hi @chrisboulton I think this could fit in per-upstream configuration, and the new UpstreamConfig in service-defaults: https://www.consul.io/docs/connect/config-entries/service-defaults#upstreamconfig
Regarding the note about connect timeout, there are two points there. First is that yes we generally want to move away from per-instance configuration. However, if the discovery chain isn't being configured, I can see the benefit of allowing this flag to be set elsewhere. Now that we have centralized upstream config in service-defaults it wouldn't be exclusively in individual proxy registrations.
An issue that concerns me is the proliferation of places where one setting can be configured, and then it becoming unclear which one takes precedence.
Do nothing, rely on
service-router ServiceRouteDestination.RequestTimeout
https://docs.github.com/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntaxThe work is already done to allow the timeout to be configurable here, and I believe the intent behind putting it in a service resolver is to allow per HTTP path customisations of timeouts, etc. The downside is that this would mean that any time you wanted to change the default Envoy timeout value of 15s, you would need to define a
service-router
for your service.
@chrisboulton I tried this solution but that doesn't work with Transparent proxy mode enabled. Is there a workaround while using Transparent proxy mode?
Feature Description
Provide a way to configure upstream listener to configure timeout.
Use Case(s)
https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/route/route.proto#route-routeaction The envoy default value is 15s second which is too high or too low depends on the use case.