jacksontj / promxy

An aggregating proxy to enable HA prometheus
MIT License
1.12k stars 126 forks source link

http: TLS handshake error from <promxy IP>:port: EOF #638

Open winhvu opened 4 months ago

winhvu commented 4 months ago

We have a deployment as below:


and here is http_client we passed to Promxy:

  dial_timeout: 10s
     ca_file: <path_to_CA>
     cert_file: <path_to_cert>
     key_file: <path_to_key>
     insecure_skip_verify: false

When I perform multiple PromQL queries in parallel towards Promxy via curl command like this:

seq 1 200 | xargs -n1 -P10 curl --cert tls.crt --key tls.key --cacert ca.crt "https://promxy-endpoint:9091/api/v1/query?query=up"

I got lots of tls error messages http: TLS handshake error from <promxy IP>:port: EOF in our reverse proxy. It does not happen when the queries are sent in sequence.

I have decoded certs of both sides, client and server, they are all valid certificates.

Checking Promxy logs, there is no error messages; logs shows queries with returned successful code (200).

I would like to know if Promxy supports queries in parallel?

I have tried to test with query.max-concurrency=20 and the default one query.max-concurrency=-1; it does not help.

winhvu commented 4 months ago

When I limit the number of the concurrencies to 1 by setting query.max-concurrency=1, there is no more tls error message.

winhvu commented 4 months ago

and there is no issue at all if we send queries in parallel directly to reverse proxy, bypass the promxy pod.

winhvu commented 4 months ago

I have tried with some another setups to narrow down the scope that could cause the problem:

1) Use static server group rather than the dynamic one to confirm if the issue would cause by target discovery or not.

2) Use one target server rather than 02 from the static group to check if there would have race condition while Promxy deals with multiple targets.

3) Add more time to timeout and dial_timeout to see if the default times would be too short that Promxy might terminate the connection while tls handshake is not yet done.

But TLS error messages still show up in the logs.

winhvu commented 4 months ago

@jacksontj Do you have any feedback/comments on this issue? Do you think there is race condition there in Promxy?

jacksontj commented 4 months ago

First off, thanks for reaching out!

I did some initial digging but your configuration seems incomplete (maybe just not included in the issue?). Specifically its missing the scheme configuration which would make all the requests downstream from promxy be http instead of https.

So in my local testing I have promxy -> nginx (with TLS) -> `demo.robustperception.io:9090

And I was able to get data working correctly and use a variation on your curl to test parallel usage:

seq 1 200 | xargs -n1 -P10 curl -k "https://localhost:8082/api/v1/query?query=up"

I have used promxy in front of HTTPs downstreams before without issue; so I don't expect you'll run into issues (other than the config; which is a bit odd because the prometheus scrape_config is a bit odd).

Hopefully that helps?

winhvu commented 4 months ago

Thanks @jacksontj for the reply.

Yes, we do have scheme in the promxy configuration:

  - job_name: 'prometheus-pods'
    # anti-affinity for merging values in timeseries between hosts in the server_group
    anti_affinity: 15s

      - role: pod
            - testing-ns
    # configures the protocol scheme used for requests. Defaults to http
    scheme: https
    # options for promxy's HTTP client when talking to hosts in server_groups
      # dial_timeout controls how long promxy will wait for a connection to the downstream
      dial_timeout: 10s
        ca_file: /run/secrets/trusted-root-cert/ca.crt
        cert_file: /run/secrets/prometheus-client-cert/tls.crt
        key_file: /run/secrets/prometheus-client-cert/tls.key
        insecure_skip_verify: false
    relabel_configs: []

The scheme http displayed in the log is misleading. However, I have enabled promxy log with trace level to see what the scheme and Prometheus endpoints promxy communicate with, and it is totally correct.

I have used promxy in front of HTTPs downstreams before without issue

The issue is not always showed up if the traffics towards promxy is low; it happens more frequently if we add more traffics like running the same curl command above from multiple terminals (e.g. I ran on 03 terminals in parallel)

winhvu commented 4 months ago

Hi @jacksontj

Do you have a chance to reproducing the issue using the way I mentioned above?