hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.26k stars 4.41k forks source link

AuthorizeServerConn failed certificate validation #11207

Closed Lord-Y closed 2 years ago

Lord-Y commented 2 years ago

Hi team,

I running a consul server cluster of 3 nodes VM. My consul version was 1.9.8, and I deciced to upgrade my cluster to 1.10.3 version and in here is what I saw in the logs:

from=10.0.104.13:57854 operation="raft RPC"
2021-10-02T13:53:24.695Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-02T13:53:24.696Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.0.104.12:8300 [Candidate]" term=168
2021-10-02T13:53:24.701Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-02T13:53:24.701Z [DEBUG] agent.server.raft: vote granted: from=2afb25d5-8234-e9fc-dcef-babf75741651 term=168 tally=1
2021-10-02T13:53:24.707Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 24a6c35e-9bab-7bae-9a31-ae94400ac1ef 10.0.104.13:8300}" error=EOF
2021-10-02T13:53:24.708Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 3840bd0b-0980-8c1e-47c7-73bfef3bb5b9 10.0.104.10:8300}" error=EOF
2021-10-02T13:53:24.803Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.euw1.consul: %!w(<nil>): from=10.0.104.10:48992 operation="raft RPC"
2021-10-02T13:53:24.913Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.euw1.consul: %!w(<nil>): from=10.0.104.13:57858 operation="raft RPC"
2021-10-02T13:53:26.045Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.euw1.consul: %!w(<nil>): from=10.0.104.13:57862 operation="raft RPC"
2021-10-02T13:53:26.083Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.euw1.consul: %!w(<nil>): from=10.0.104.10:48998 operation="raft RPC"

Before making the upgrade I shut them all down and backup the data just in case as it's not a production server. After googling and not able to make things works, I revert my install with the old backup. I decided then to upgrade my 1.9.8 to 1.9.10 and unfortunately I got the same errors!

Reproduction Steps

Steps to reproduce this issue, eg:

My consul config is:

{
    "acl": {
        "default_policy": "deny",
        "down_policy": "extend-cache",
        "enable_token_persistence": true,
        "enabled": true,
        "tokens": {
            "agent": "xxxx",
            "default": "xxxx",
            "master": "xxxx"
        }
    },
    "advertise_addr": "x.x.x.x",
    "auto_encrypt": {
        "allow_tls": true
    },
    "bind_addr": "0.0.0.0",
    "bootstrap_expect": 3,
    "ca_file": "/etc/consul.d/.ssl/euw1-consul-agent-ca.pem",
    "cert_file": "/etc/consul.d/.ssl/euw1-server-consul.pem",
    "client_addr": "0.0.0.0",
    "connect": {
        "enabled": true
    },
    "data_dir": "/data/consul",
    "datacenter": "euw1",
    "disable_update_check": true,
    "enable_central_service_config": true,
    "encrypt": "9a896L5GCtRFfLkyyUfLsxOjIbX4X/2b6rx9ipz74c0=",
    "key_file": "/etc/consul.d/.ssl/euw1-server-consul-key.pem",
    "log_level": "DEBUG",
    "node_name": "x.x.x.x",
    "performance": {
        "raft_multiplier": 1
    },
    "ports": {
        "grpc": 8502,
        "https": 8501
    },
    "retry_join": [
        "x.x.x.x",
        "x.x.x.x",
        "x.x.x.x"
    ],
    "server": true,
    "telemetry": {
        "prometheus_retention_time": "30s"
    },
    "ui_config": {
        "enabled": true,
        "metrics_provider": "prometheus"
    },
    "verify_incoming": false,
    "verify_outgoing": true,
    "verify_server_hostname": true
}

Consul info Server

Server info ``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 34a6478b version = 1.9.10 consul: acl = legacy bootstrap = false known_datacenters = 1 leader = false leader_addr = server = true raft: applied_index = 32774 commit_index = 0 fsm_pending = 0 last_contact = never last_log_index = 45589 last_log_term = 251 last_snapshot_index = 32774 last_snapshot_term = 116 latest_configuration = [{Suffrage:Voter ID:2afb25d5-8234-e9fc-dcef-babf75741651 Address:10.0.104.12:8300} {Suffrage:Voter ID:24a6c35e-9bab-7bae-9a31-ae94400ac1ef Address:10.0.104.13:8300} {Suffrage:Voter ID:3840bd0b-0980-8c1e-47c7-73bfef3bb5b9 Address:10.0.104.10:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Candidate term = 993 runtime: arch = amd64 cpu_count = 1 goroutines = 128 max_procs = 1 os = linux version = go1.15.15 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 34 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 3530 members = 10 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 47 members = 3 query_queue = 0 query_time = 1 ```

Logs fragments

==> Starting Consul agent...
           Version: '1.9.10'
           Node ID: '2afb25d5-8234-e9fc-dcef-babf75741651'
         Node name: 'x.x.x.x'
        Datacenter: 'euw1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: 8501, gRPC: 8502, DNS: 8600)
      Cluster Addr: 10.0.104.12 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: false, Auto-Encrypt-TLS: true

Operating system and Environment details

Debian 10

Can someone tell me what's going on as the cluster was down after both upgrades? I don't have any other meaningful logs to provide but feel free to ask more if necessary. The firewal ports opened between the 3 servers are 8300-8302 (tcp), 8501-8502 (tcp), 8301-8302 (udp).

dnephin commented 2 years ago

Thank you for raising this issue! It sounds like the problem is related to the new validation that was added in #10925 and #10964. i just noticed there is a problem with how we capture the error in that second PR, which is why the error message is nil instead of more helpful text.

The problem is most likely the TLS certificates you are using on the other servers. It could be the DNSName is not correct and doesn't match server.euw1.consul.

How did you create the certificates for the Consul servers? I'd suggest looking at the certs you have configured on each Server (they are configured using cert_file) to make sure they all have the expected DNSName, and that they all come from the CA configured using ca_file or ca_path.

Lord-Y commented 2 years ago

@dnephin Thx for your reply. Here is how I generated my certificates last year:

ENV=dev
PROJECT=xxx-yyy
WORKLOAD=xyz

EXTERNAL_DOMAIN=consul.xyz.yyy.${ENV}.mydomain.com
consul tls ca create -days 100000 -name-constraint \
  -additional-name-constraint=${EXTERNAL_DOMAIN} \
  -additional-name-constraint=${WORKLOAD}-consul-euw1b-lin01-${ENV}.europe-west1-b.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1b-lin02-${ENV}.europe-west1-b.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1b-lin03-${ENV}.europe-west1-b.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1b-lin04-${ENV}.europe-west1-b.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1b-lin05-${ENV}.europe-west1-b.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1b-lin06-${ENV}.europe-west1-b.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1b-lin06-${ENV}.europe-west1-b.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1c-lin01-${ENV}.europe-west1-c.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1c-lin02-${ENV}.europe-west1-c.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1c-lin03-${ENV}.europe-west1-c.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1c-lin04-${ENV}.europe-west1-c.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1c-lin05-${ENV}.europe-west1-c.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1c-lin06-${ENV}.europe-west1-c.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1c-lin07-${ENV}.europe-west1-c.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1d-lin01-${ENV}.europe-west1-d.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1d-lin02-${ENV}.europe-west1-d.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1d-lin03-${ENV}.europe-west1-d.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1d-lin04-${ENV}.europe-west1-d.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1d-lin05-${ENV}.europe-west1-d.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1d-lin06-${ENV}.europe-west1-d.c.${PROJECT}.internal \
  -additional-name-constraint=${WORKLOAD}-consul-euw1d-lin07-${ENV}.europe-west1-d.c.${PROJECT}.internal

# for 1st server
consul tls cert create -server -days 100000 -dc euw1 -additional-dnsname=${EXTERNAL_DOMAIN} -additional-dnsname=${WORKLOAD}-consul-euw1b-lin01-${ENV}.europe-west1-b.c.${PROJECT}.internal

# for 2nd server
consul tls cert create -server -days 100000 -dc euw1 -additional-dnsname=${EXTERNAL_DOMAIN} -additional-dnsname=${WORKLOAD}-consul-euw1c-lin01-${ENV}.europe-west1-c.c.${PROJECT}.internal

# for 3rd server
consul tls cert create -server -days 100000 -dc euw1 -additional-dnsname=${EXTERNAL_DOMAIN} -additional-dnsname=${WORKLOAD}-consul-euw1d-lin01-${ENV}.europe-west1-d.c.${PROJECT}.internal

Here is the content of each certificates:

# server 1
openssl x509 -in euw1-server-consul.pem -text -noout
Certificate:
    Data:
.....
.....
.....
.....
.....
            X509v3 Subject Alternative Name: 
                DNS:consul.xyz.yyy.dev.mydomain.com, DNS:xyz-consul-euw1b-lin01-dev.europe-west1-b.c.xxx-yyy.internal, DNS:server.euw1.consul, DNS:localhost, IP Address:127.0.0.1
    Signature Algorithm: ecdsa-with-SHA256
.....
.....
.....

# server 2
openssl x509 -in euw1-server-consul.pem -text -noout
Certificate:
    Data:
.....
.....
.....
.....
.....
            X509v3 Subject Alternative Name: 
                DNS:consul.xyz.yyy.dev.mydomain.com, DNS:xyz-consul-euw1c-lin01-dev.europe-west1-c.c.xxx-yyy.internal, DNS:server.euw1.consul, DNS:localhost, IP Address:127.0.0.1
    Signature Algorithm: ecdsa-with-SHA256
.....
.....
.....

# server 3
openssl x509 -in euw1-server-consul.pem -text -noout
Certificate:
    Data:
.....
.....
.....
.....
.....
            X509v3 Subject Alternative Name: 
                DNS:consul.xyz.yyy.dev.mydomain.com, DNS:xyz-consul-euw1d-lin01-dev.europe-west1-d.c.xxx-yyy.internal, DNS:server.euw1.consul, DNS:localhost, IP Address:127.0.0.1
    Signature Algorithm: ecdsa-with-SHA256
.....
.....
.....

In my SAN entries, I have the node name so, I believe that the cert verification should be able to match.

wesleyeleazar commented 2 years ago

I'm experiencing the same error with any release that "consider presented intermediates during server connection tls handshake". This includes 1.10.3 and 1.9.10:

2021-10-04T17:18:10.476Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.playground.plgconsul: %!w(<nil>): from=10.201.20.22:39484 operation="raft RPC"

There is a similar error that occurs for releases of subversions that mention KNOWN ISSUES in the changelog. These include 1.10.2 and 1.9.9:

2021-10-04T17:14:36.318Z [WARN]  agent.server.rpc: a TLS certificate with a CommonName of server.playground.plgconsul is required for this operation: from=10.201.20.22:39140 operation="raft RPC"

In our case, the server certificates were issued directly by a (vault) root CA, so we didn't expect any problems regarding intermediate CA certificates used to sign server certificates. SAN certs served by our servers are like so:

       Subject: CN = server.playground.plgconsul

            X509v3 Subject Alternative Name: 
                DNS:server.playground.plgconsul, IP Address:127.0.0.1

I can always fall back to the last known working version, that is 1.9.8. When that binary is put back in, raft syncs it up right away and the server joins the cluster.

dnephin commented 2 years ago

Thank you for the detailed reports. It sounds like the DNSName is correct in both of these cases, but something else about the certificate seems to be causing a problem.

11210 should fix the missing error text, which would help debug this problem. For releases 1.10.2 and 1.9.9, the error should be in the debug logs with a message of "AuthorizeServerConn failed certificate validation". Normally debug logs are turned off, but you should be able to see the message using consul monitor -log-level debug.

If either of you is able to try with 1.10.2 or 1.9.9 again and find that log message, it should help identify the source of the problem.

wesleyeleazar commented 2 years ago

Ran consul monitor -log-level debug directly on the upgraded server (1.9.9), it's trying to start its own election:

2021-10-04T18:30:11.243Z [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/services from=127.0.0.1:51366 latency=162.457µs
2021-10-04T18:30:11.243Z [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/checks from=127.0.0.1:51366 latency=115.428µs
2021-10-04T18:30:13.020Z [DEBUG] agent: Check status updated: check=_nomad-check-73f7361be4d3ffd00dbe0214ebcc33b279f9baa4 status=passing
2021-10-04T18:30:14.070Z [DEBUG] agent: Check status updated: check=_nomad-check-b601bb092df00ce526aed5d4ec66fc0885116be4 status=passing
2021-10-04T18:30:14.670Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-04T18:30:14.670Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.201.20.20:8300 [Candidate]" term=1845
2021-10-04T18:30:14.676Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-04T18:30:14.676Z [DEBUG] agent.server.raft: vote granted: from=f2175ca6-74c0-7b0a-01ca-e1af1e89a1c0 term=1845 tally=1
2021-10-04T18:30:16.926Z [DEBUG] agent: Check status updated: check=_nomad-check-ed687cd3e1411e1db0e2a4152cae62a8baa68b80 status=passing
2021-10-04T18:30:17.476Z [WARN]  agent.server.rpc: a TLS certificate with a CommonName of server.playground.plgconsul is required for this operation: from=10.201.20.22:45488 operation="raft RPC"
2021-10-04T18:30:17.788Z [WARN]  agent.server.rpc: a TLS certificate with a CommonName of server.playground.plgconsul is required for this operation: from=10.201.20.22:45490 operation="raft RPC"
2021-10-04T18:30:20.092Z [DEBUG] agent.server: Cannot upgrade to new ACLs: leaderMode=3 mode=2 found=true leader=
2021-10-04T18:30:21.754Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=10.201.20.21:46676
2021-10-04T18:30:23.026Z [DEBUG] agent: Check status updated: check=_nomad-check-73f7361be4d3ffd00dbe0214ebcc33b279f9baa4 status=passing
2021-10-04T18:30:24.070Z [DEBUG] agent: Check status updated: check=_nomad-check-b601bb092df00ce526aed5d4ec66fc0885116be4 status=passing
2021-10-04T18:30:24.524Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-04T18:30:24.524Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.201.20.20:8300 [Candidate]" term=1846
2021-10-04T18:30:24.530Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-04T18:30:24.530Z [DEBUG] agent.server.raft: vote granted: from=f2175ca6-74c0-7b0a-01ca-e1af1e89a1c0 term=1846 tally=1
2021-10-04T18:30:25.055Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=10.201.20.45:47134
2021-10-04T18:30:25.861Z [ERROR] agent: Coordinate update error: error="No cluster leader"
2021-10-04T18:30:25.929Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=10.201.20.44:38484
2021-10-04T18:30:26.926Z [DEBUG] agent: Check status updated: check=_nomad-check-ed687cd3e1411e1db0e2a4152cae62a8baa68b80 status=passing
2021-10-04T18:30:28.034Z [WARN]  agent.server.rpc: a TLS certificate with a CommonName of server.playground.plgconsul is required for this operation: from=10.201.20.22:45498 operation="raft RPC"
2021-10-04T18:30:28.278Z [WARN]  agent.server.rpc: a TLS certificate with a CommonName of server.playground.plgconsul is required for this operation: from=10.201.20.22:45500 operation="raft RPC"
2021-10-04T18:30:32.127Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-04T18:30:32.127Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.201.20.20:8300 [Candidate]" term=1847
2021-10-04T18:30:32.134Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-04T18:30:32.134Z [DEBUG] agent.server.raft: vote granted: from=f2175ca6-74c0-7b0a-01ca-e1af1e89a1c0 term=1847 tally=1
2021-10-04T18:30:33.032Z [DEBUG] agent: Check status updated: check=_nomad-check-73f7361be4d3ffd00dbe0214ebcc33b279f9baa4 status=passing
2021-10-04T18:30:34.071Z [DEBUG] agent: Check status updated: check=_nomad-check-b601bb092df00ce526aed5d4ec66fc0885116be4 status=passing
2021-10-04T18:30:34.644Z [DEBUG] agent.server.memberlist.wan: memberlist: Initiating push/pull sync with: nmdserver-plg-02.playground 10.201.20.21:8302
2021-10-04T18:30:34.972Z [DEBUG] agent.server.memberlist.lan: memberlist: Initiating push/pull sync with: nmdclient-naeast-a-plg-t5dn 10.201.20.45:8301
2021-10-04T18:30:36.927Z [DEBUG] agent: Check status updated: check=_nomad-check-ed687cd3e1411e1db0e2a4152cae62a8baa68b80 status=passing
2021-10-04T18:30:37.240Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-04T18:30:37.240Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.201.20.20:8300 [Candidate]" term=1848
2021-10-04T18:30:37.247Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-04T18:30:37.247Z [DEBUG] agent.server.raft: vote granted: from=f2175ca6-74c0-7b0a-01ca-e1af1e89a1c0 term=1848 tally=1
2021-10-04T18:30:38.280Z [WARN]  agent.server.rpc: a TLS certificate with a CommonName of server.playground.plgconsul is required for this operation: from=10.201.20.22:45518 operation="raft RPC"
2021-10-04T18:30:39.315Z [WARN]  agent.server.rpc: a TLS certificate with a CommonName of server.playground.plgconsul is required for this operation: from=10.201.20.22:45520 operation="raft RPC"
2021-10-04T18:30:40.756Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2021-10-04T18:30:41.244Z [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/services from=127.0.0.1:51366 latency=153.128µs
2021-10-04T18:30:41.245Z [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/checks from=127.0.0.1:51366 latency=114.584µs
2021-10-04T18:30:43.038Z [DEBUG] agent: Check status updated: check=_nomad-check-73f7361be4d3ffd00dbe0214ebcc33b279f9baa4 status=passing
2021-10-04T18:30:43.088Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-04T18:30:43.088Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.201.20.20:8300 [Candidate]" term=1849
2021-10-04T18:30:43.102Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-04T18:30:43.102Z [DEBUG] agent.server.raft: vote granted: from=f2175ca6-74c0-7b0a-01ca-e1af1e89a1c0 term=1849 tally=1
wesleyeleazar commented 2 years ago

Logs in the case of 1.9.10:

2021-10-04T18:42:52.291Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.playground.plgconsul: %!w(<nil>): from=10.201.20.22:46596 operation="raft RPC"
2021-10-04T18:42:52.743Z [DEBUG] agent.server: Cannot upgrade to new ACLs: leaderMode=3 mode=2 found=true leader=
2021-10-04T18:42:53.258Z [DEBUG] agent: Check status updated: check=_nomad-check-b601bb092df00ce526aed5d4ec66fc0885116be4 status=passing
2021-10-04T18:42:54.219Z [DEBUG] agent: Check status updated: check=_nomad-check-ed687cd3e1411e1db0e2a4152cae62a8baa68b80 status=passing
2021-10-04T18:42:55.121Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=10.201.20.45:48014
2021-10-04T18:42:55.142Z [DEBUG] agent: Check status updated: check=_nomad-check-73f7361be4d3ffd00dbe0214ebcc33b279f9baa4 status=passing
2021-10-04T18:42:56.432Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.playground.plgconsul: %!w(<nil>): from=10.201.20.22:46600 operation="raft RPC"
2021-10-04T18:42:56.594Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-04T18:42:56.594Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.201.20.20:8300 [Candidate]" term=1944
2021-10-04T18:42:56.601Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-04T18:42:56.601Z [DEBUG] agent.server.raft: vote granted: from=f2175ca6-74c0-7b0a-01ca-e1af1e89a1c0 term=1944 tally=1
2021-10-04T18:42:59.401Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=10.201.20.22:55750
2021-10-04T18:43:02.827Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-04T18:43:02.827Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.201.20.20:8300 [Candidate]" term=1945
2021-10-04T18:43:02.834Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-04T18:43:02.834Z [DEBUG] agent.server.raft: vote granted: from=f2175ca6-74c0-7b0a-01ca-e1af1e89a1c0 term=1945 tally=1
2021-10-04T18:43:03.258Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.playground.plgconsul: %!w(<nil>): from=10.201.20.22:46612 operation="raft RPC"
2021-10-04T18:43:03.258Z [DEBUG] agent: Check status updated: check=_nomad-check-b601bb092df00ce526aed5d4ec66fc0885116be4 status=passing
2021-10-04T18:43:04.219Z [DEBUG] agent: Check status updated: check=_nomad-check-ed687cd3e1411e1db0e2a4152cae62a8baa68b80 status=passing
2021-10-04T18:43:05.147Z [DEBUG] agent: Check status updated: check=_nomad-check-73f7361be4d3ffd00dbe0214ebcc33b279f9baa4 status=passing
2021-10-04T18:43:05.543Z [DEBUG] agent.server: Cannot upgrade to new ACLs: leaderMode=3 mode=2 found=true leader=
2021-10-04T18:43:06.678Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.playground.plgconsul: %!w(<nil>): from=10.201.20.22:46622 operation="raft RPC"
2021-10-04T18:43:08.191Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-04T18:43:08.191Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.201.20.20:8300 [Candidate]" term=1946
2021-10-04T18:43:08.197Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-04T18:43:08.197Z [DEBUG] agent.server.raft: vote granted: from=f2175ca6-74c0-7b0a-01ca-e1af1e89a1c0 term=1946 tally=1
2021-10-04T18:43:11.315Z [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/services from=127.0.0.1:52288 latency=154.638µs
2021-10-04T18:43:11.316Z [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/checks from=127.0.0.1:52288 latency=141.052µs
2021-10-04T18:43:12.273Z [ERROR] agent: Coordinate update error: error="No cluster leader"
2021-10-04T18:43:13.259Z [DEBUG] agent: Check status updated: check=_nomad-check-b601bb092df00ce526aed5d4ec66fc0885116be4 status=passing
2021-10-04T18:43:14.220Z [DEBUG] agent: Check status updated: check=_nomad-check-ed687cd3e1411e1db0e2a4152cae62a8baa68b80 status=passing
2021-10-04T18:43:14.371Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.playground.plgconsul: %!w(<nil>): from=10.201.20.22:46630 operation="raft RPC"
2021-10-04T18:43:15.153Z [DEBUG] agent: Check status updated: check=_nomad-check-73f7361be4d3ffd00dbe0214ebcc33b279f9baa4 status=passing
2021-10-04T18:43:15.478Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2021-10-04T18:43:16.923Z [WARN]  agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.playground.plgconsul: %!w(<nil>): from=10.201.20.22:46640 operation="raft RPC"
2021-10-04T18:43:17.660Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-04T18:43:17.660Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.201.20.20:8300 [Candidate]" term=1947
2021-10-04T18:43:17.666Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-04T18:43:17.666Z [DEBUG] agent.server.raft: vote granted: from=f2175ca6-74c0-7b0a-01ca-e1af1e89a1c0 term=1947 tally=1
2021-10-04T18:43:23.213Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2021-10-04T18:43:23.213Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.201.20.20:8300 [Candidate]" term=1948
2021-10-04T18:43:23.222Z [DEBUG] agent.server.raft: votes: needed=2
2021-10-04T18:43:23.222Z [DEBUG] agent.server.raft: vote granted: from=f2175ca6-74c0-7b0a-01ca-e1af1e89a1c0 term=1948 tally=1
2021-10-04T18:43:23.259Z [DEBUG] agent: Check status updated: check=_nomad-check-b601bb092df00ce526aed5d4ec66fc0885116be4 status=passing
2021-10-04T18:43:24.220Z [DEBUG] agent: Check status updated: check=_nomad-check-ed687cd3e1411e1db0e2a4152cae62a8baa68b80 status=passing
gulavanir commented 2 years ago

I am facing the same issue after upgrading to 1.10.3. Consul certificates are generated using vault pki and consul-template. This setup was working on 1.9.2 but doesn't work with 1.10.3.

DNS in certificates generated: DNS:consul.service.consul, DNS:localhost, DNS:server.dev-setup.consul, IP Address:127.0.0.1, IP Address:192.168.50.91

Error in consul: agent.server.rpc: AuthorizeServerConn failed certificate validation for certificate with a SAN.DNSName of server.dev-setup.consul: %!w(<nil>): from=192.168.50.93:35059 operation="raft RPC"

dnephin commented 2 years ago

tldr; if my findings below are correct, setting verify_incoming_rpc=true in the agent config should resolve the problem, but we'll still want to patch this so that others don't run into the problem.

Thanks for the logs. Unfortunately I don't see the DEBUG message I was expecting in that output, which is strange.

Some things to check using openssl x509 -in ./server.pem -text -noout, where server.pem is the filename in consul configuration at field cert_file.

  1. The X509v3 Subject Alternative Name should include the name from the error message (everyone seems to have confirmed this so far, but mentioning it anyway)
  2. The X509v3 Extended Key Usage should include TLS Web Client Authentication (it can also include server auth, but client auth is the one that is required because it's acting as a client in this communication)
  3. The verify_{incoming, incoming_rpc, server_hostname} settings in your configuration are the same on all Servers.

Writing this list up, I just noticed in the original post that verify_incoming is set to false. Other people reporting this problem, do you also use verify_incoming set to false with verify_server_hostname set to true ?

Looking at the code again I suspect this is the problem. The authorization loops over VerifiedChains, but when verify_incoming is set to false VerifiedChains is not populated according to the docs for ConnectionState.VerifiedChains. So we skip the Verify check and immediate return an error.

Assuming this is the problem we have a couple options for fixing this. We could skip this authorization entirely if verify_incoming=false, but that would effectively ignore the verify_server_hostname option, which does not sound good.

I think as a workaround you could set verify_incoming_rpc=true in your config, and that should fix the problem for now. If you're able to try this out, please let us know if this fixes the problem for you.

Maybe setting verify_server_hostname should require verify_incoming or verify_incoming_rpc be set to true, otherwise the agent should fail to start with an error?

Lord-Y commented 2 years ago

@dnephin I had to set verify_incoming to false as my consul client in my k8s cluster where having issues to join the server following my explanations here: https://discuss.hashicorp.com/t/consul-tls-verification-from-k8s-cluster-to-consul-server/11883/9?u=lord-y . Anyway, for this part, I'll trying to enable it back later

For the upgrade part, by adding verify_incoming_rpc=true I was able to upgrade to consul version 1.9.10 and then 1.10.3 successfully!!!!!

I'll keep you update on the k8s part tomorrow.

wesleyeleazar commented 2 years ago

That worked also for an upgrade 1.9.8 to 1.9.9. Setting the verify_incoming_rpc = true successfully allowed it to join the raft, as confirmed by the consistent last_log_indexes across the servers.

Settings on all the servers: verify_incoming = false verify_server_hostname = true

@dnephin the three conditions are and were always satisfied in our setup. Also, thank you.

Lord-Y commented 2 years ago

I also followed the client side config here https://learn.hashicorp.com/tutorials/consul/tls-encryption-secure#configure-the-clients but @wesleyeleazar you're not having any issues with your sidecar proxy?

Here is my consul log: 2021-10-07T13:54:07.332Z [WARN] agent: Check socket connection failed: check=service:my-service-sidecar-proxy error="dial tcp x.x.x.x:20000: connect: connection refused" 2021-10-07T13:54:07.332Z [WARN] agent: Check is now critical: check=service:my-service-sidecar-proxy

Envoy log: [2021-10-07 14:28:04.427][454][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection termination

Lord-Y commented 2 years ago

Alright my services on my VMs are now working. I switched from:

"ports": { "grpc": 8502, "https": 8501 }, to "ports": { "grpc": 8502 },

On k8s part everything is working.

Thx guys.

wesleyeleazar commented 2 years ago

@Lord-Y, I'll test the sidecar proxies soon, I just had to prepare the cluster for it.

dnephin commented 2 years ago

Thank you for confirming the workaround! I'll PR a fix for the config.

wesleyeleazar commented 2 years ago

@Lord-Y: to follow up, the sidecar proxies I've tested so far have been happy and healthy. To clarify, yes indeed, they were reporting the same log issues as you did:

2021-10-07T21:38:34.175Z [WARN]  agent: Check socket connection failed: check=service:_nomad-task-5d5e6d59-0d72-039b-0fd6-d3af2b
d5172c-group-cache-redis-cache-db-sidecar-proxy:1 error="dial tcp 10.201.20.215:24424: connect: connection refused"
2021-10-07T21:38:34.175Z [WARN]  agent: Check is now critical: check=service:_nomad-task-5d5e6d59-0d72-039b-0fd6-d3af2bd5172c-gr
oup-cache-redis-cache-db-sidecar-proxy:1

but in about 2 minutes' time the checks cleared. Uncertain why it had to take that long, but I doubt it is related to this issue.