Connect sidecar proxy upstream config connect_timeout_ms does not work

tmiroslav commented 2 years ago

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

It ls failing to setup connection timeout in upstream config per docs here

A paragraph or two about the issue you're experiencing: I am facing timeouts every time upstream service that my service is connecting to is not responding in 5s. This is too low for my use case and I want to extend this timeout to 10s. But when setup proper connect_timeout_ms in upstream config, I realize that my service still timeouts after 5s. So, my config change in upstream proxy config is not applied.

Reproduction Steps

Steps to reproduce this issue, eg:

This is my service definition file:

 service {
  name = "cpanel"
  port = 4000
  connect {
    sidecar_service {
      proxy {
        upstreams = [
          {
            destination_name = "gateway-internal-api"
            local_bind_port  = 10000
            config {
              connect_timeout_ms = 10000
              }
          },
        ]
      }
    }
  }

Consul info for both Client and Server

Consul v1.8.3 Envoy 1.14.2

Client info:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 3
    services = 2
build:
    prerelease = 
    revision = a9322b9c
    version = 1.8.3
consul:
    acl = enabled
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 8
    goroutines = 102
    max_procs = 8
    os = linux
    version = go1.14.7
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 57
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 16982
    members = 53
    query_queue = 0
    query_time = 1

Server info:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease = 
    revision = a9322b9c
    version = 1.8.3
consul:
    acl = enabled
    bootstrap = false
    known_datacenters = 1
    leader = false
    leader_addr = 10.228.45.65:8300
    server = true
raft:
    applied_index = 21010459
    commit_index = 21010459
    fsm_pending = 0
    last_contact = 4.249427ms
    last_log_index = 21010459
    last_log_term = 1728
    last_snapshot_index = 20997099
    last_snapshot_term = 1728
    latest_configuration = [{Suffrage:Voter ID:1aceb41a-9309-720d-0548-703bf300a940 Address:10.228.44.218:8300} {Suffrage:Voter ID:918add50-a22b-a418-82ae-2d8d2fe5465e Address:10.228.45.65:8300} {Suffrage:Voter ID:89847217-65f3-ed14-f1a8-244c44996eb8 Address:10.228.46.88:8300}]
    latest_configuration_index = 0
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 1728
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 479
    max_procs = 4
    os = linux
    version = go1.14.7
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 57
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 16982
    members = 53
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 135
    members = 3
    query_queue = 0
    query_time = 1

Operating system and Environment details

more /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"

Amier3 commented 2 years ago

Hey @tmiroslav ,

I believe for the timeouts to work it has to be set on the proxy level (local_connect_timeout_ms in gateway-internal-api service definition) and in the upstream config ( what you have set above ) -- I suspect that may be the issue. If you already set that value in the gateway-internal-api service definition and it's still not working, could you provide that file so we can look into this further?

tmiroslav commented 2 years ago

Hi @Amier3

Thank you! I am going to test this and will come to you as soon as I get results.

BR, Miroslav

tmiroslav commented 2 years ago

Hi @Amier3

It's no better after adding local_connect_timeout_msin gateway-internal-apiservice definition. I am running Consul in VMs. I have service cpanel talk to service gateway-internal-api. I already pasted above cpanel service definition. This is gateway-internal-api,after I added proxy config parameter you suggested:

service {
    name = "gateway-internal-api"
    port = 8787
    connect {
      sidecar_service {
        proxy {
          config {
             local_connect_timeout_ms = 10000
          }
          upstreams = [
            {
              destination_name = "exhibitor"
              local_bind_port  = 10025
            },
            {
              destination_name = "redis"
              local_bind_port  = 10060
            },
          ]
        }
      }
    }

Still, I am getting in the logs something like: user_id=92a70b0f-a3da-43a9-a176-f95b475117 client_ip=77.46.205.166 operation_name=getPoolStatus [error] GET http://localhost:10000/api/v1/servers/ -> error: :timeout (5001.237 ms)

Where I can see that it still timeouts after 5s!? Is this a bug, or I am still missing to set configs properly?

Thank you!

tmiroslav commented 2 years ago

Hi @Amier3

Any advice what I should do next? Should I go for Consul/Envoy upgrade maybe for this to make working? We are facing this issue in production, and it's really urgent to do something to overcome 5s timeouts!

Thanx! Miroslav

Amier3 commented 2 years ago

Hey @tmiroslav

Apologies for the delayed response! After looking into this issue a bit more, I realized that i'd need to pull in some of the engineering team to help figure out how to fix this and if an upgrade was required. Due to the holidays that were quickly approaching, it was hard to find the bandwidth to dig deep into this in December.

Are you still experiencing this issue and did you end up upgrading to try to fix it?

Amier3 commented 2 years ago

@tmiroslav

Also, it'd help a lot if you can provide us with an envoy config dump using curl localhost:19000/config_dump

chrisboulton commented 2 years ago

@tmiroslav maybe have a look at my comment here: https://github.com/hashicorp/consul/issues/6382#issuecomment-758318964 - I suspect the section on Upstream Request Timeouts is what you're running into, as it's not something that has been addressed yet. I have a couple of proposed options on that PR (including one you can use today with a service-router, and another which disables the upstream timeouts entirely, which we do with a custom build of Consul presently).

hashicorp / consul