hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.04k stars 4.39k forks source link

Consul does not display errors for failed HTTPS `checks` #21372

Open EugenKon opened 1 week ago

EugenKon commented 1 week ago

Nomad version

Nomad v1.8.0
BuildDate 2024-05-28T17:38:17Z
Revision 28b82e4b2259fae5a62e2ed47395334bea5a24c4

Operating system and Environment details

$ uname -a
Linux ip-172-31-2-172 6.5.0-1020-aws hashicorp/nomad#20~22.04.1-Ubuntu SMP Wed May  1 16:10:50 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Issue

If the service.check configured to check https without tls_server_name:

      service {
        tags = ["wi-nginx","https"]
        name = "wi-nginx"
        port = "https"

        provider = "consul"
        check {
          name            = "wi-nginx_https_health"
          type            = "http"
          protocol        = "https"
          # tls_server_name = "${var.project_name}.example.com"
          method          = "GET"
          path            = "/_.gif"
          interval        = "10s"
          timeout         = "2s"

          check_restart {
            limit = 3
            grace = "90s"
            ignore_warnings = false
          }
        }
      }

and the certificate on the Nginx site does not support SAN wi-nginx.service.internal then there is no way to see why the check fails.

Reproduction steps

See the config above.

Expected Result

Nomad should log somewhere if the check fails.

Actual Result

I can not find any error messages regarding HTTPS check.

image

Also this is strange to see the raw IP address when tls_server_name was configured: image

When this request actually fails:

$ wget https://172.31.2.172:443/_.gif
Connecting to 172.31.2.172:443 (172.31.2.172:443)
081B06BD517D0000:error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1889:
ssl_client: SSL_connect
wget: error getting response: Connection reset by peer

Proposition

Nomad should distinguish the error returned from the check and the failed check. In my case the check even did not run (server was not reached, though it function well). From the inside nginx container I ran:

$ wget https://localhost/_.gif
Connecting to localhost ([::1]:443)
08BBE122FC7B0000:error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1889:
ssl_client: SSL_connect

From here we can see that the request failed. Nginx does not have any errors in its /var/log/nginx/error.log file.

tgross commented 1 week ago

@EugenKon I'm not really sure it'd be meaningful to report TLS errors for http health checks. But in any case this is clearly an issue with Consul. Consul is sending the health check requests, recording them in state, and reporting them in its UI. Nomad just registers the check. Moving this issue to the Consul repo.