Vault with etcd backend storage with multiple adresses uses only the first server's ServerName for all endpoints, generating bad cert tls error on etcd side

phanama commented 6 years ago

Environment:

Vault Version: Vault v0.10.0 ('5dd7f25f5c4b541f2da62d70075b6f82771a650d')
Etcd Version: v3.3.3
Operating System/Architecture: Ubuntu 16.04/x86

Vault Config File:

storage "etcd" {
  address = "https://etcd-0.example.com:2379,https://etcd-1.example.com:2379,https://etcd-2.example.com:2379"
  etcd_api = "v3"
  ha_enabled = "true"
  path = "vault/"
  tls_ca_file = "/etcd/rootCA.crt"
  tls_cert_file = "/etcd/client.crt"
  tls_key_file = "/etcd/client.key"  
}

listener "tcp" {
 address     = "127.0.0.1:8200"
 tls_disable = "true"
}

api_addr = "http://127.0.0.1:8200"
cluster_addr = "http://127.0.0.1:8201"

Startup Log Output:

==> Vault server configuration:

             Api Address: http://127.0.0.1:8200
                     Cgo: disabled
         Cluster Address: https://127.0.0.1:8201
              Listener 1: tcp (addr: "127.0.0.1:8200", cluster address: "127.0.0.1:8201", tls: "disabled")
               Log Level: debug
                   Mlock: supported: true, enabled: true
                 Storage: etcd (HA available)
                 Version: Vault v0.10.0
             Version Sha: 5dd7f25f5c4b541f2da62d70075b6f82771a650d

==> Vault server started! Log data will stream in below:

2018-04-13T16:15:52.200+0700 [DEBUG] storage.cache: creating LRU cache: size=0
2018-04-13T16:15:52.208+0700 [DEBUG] cluster listener addresses synthesized: cluster_addresses=[127.0.0.1:8201]

Expected Behavior:

All the etcd endpoints don't throw tls error and accept the connection

Actual Behavior:

Two last etcd endpoints thrown tls error, stating bad certificate, thus rejecting them. I suspect that this is because Vault sends request to them using the first server's (etcd-0) server name.

Sample etcd error logs: (from etcd-1. etcd-2 showed the same error)

2018-04-13 08:01:50.267438 I | embed: rejected connection from "10.10.145.64:32946" (error "remote error: tls: bad certificate", ServerName "etcd-0.example.com")
2018-04-13 08:05:30.852145 I | embed: rejected connection from "10.10.145.64:33136" (error "remote error: tls: bad certificate", ServerName "etcd-0.example.com")
2018-04-13 08:05:30.932451 I | embed: rejected connection from "10.10.145.64:33144" (error "remote error: tls: bad certificate", ServerName "etcd-0.example.com")

Steps to Reproduce:

Create an etcd cluster of three endpoints with client-tls enabled
Implement vault with etcd backend storage with multiple endpoints
Start up the server
See the etcd error

jefferai commented 6 years ago

@xiang90

pconcepcion commented 6 years ago

I'm having the same issue, and I've tested changing the order of the servers on the address configuration option and the servers that give the error change

So if I use:

storage "etcd" {
  address  = "https://etcd2.example.com:2379,https://etcd1.example.com:2379,https://etcd3.example.com:2379"
...
}

I get the error (error "remote error: tls: bad certificate", ServerName "etcd2.example.com") on etcd1.example.com and etcd3.example.com, but if I use:

storage "etcd" {
  address  = "https://etcd1.example.com:2379,https://etcd2.example.com:2379,https://etcd3.example.com:2379"
...
}

I get the error (error "remote error: tls: bad certificate", ServerName "etcd1.example.com") on etcd2.example.com and etcd3.example.com

tdwyer commented 6 years ago

I have the exact same problem. Also, the Vault servers only talk to the first Etcd server in the address list, and do not fail over to the other Etcd servers.

tdwyer commented 6 years ago

I think the HA problem and the TLS errors are symptoms of the same error.

See my comments on the related issue...

https://github.com/hashicorp/vault/issues/4961

Setting etcd_api = "v2" in the Vault Server config solves the problem.

vfauth commented 6 years ago

Hello. I have exactly the same issue, is a fix in the works ?

tdwyer commented 5 years ago

This problem still exists in Vault 1.0.0-beta1

edganiukov commented 5 years ago

We experienced the same issue - Vault uses only first address in the list of etcd endpoints. When I stop this instance - vault status fails. And when two vault instances have different first etcd endpoint - then both instances become a master. This is a huge issue for us.

raoofm commented 5 years ago

@gyuho @hexfusion @philips Is etcd team still interested/committed in maintaining this as promised here or more specifically https://github.com/hashicorp/vault/pull/2168#issuecomment-266090432 ?

Or is it now up to the community to drive it and is given up?

hexfusion commented 5 years ago

Is etcd team still interested/committed in maintaining this as promised here or more specifically #2168 (comment) ? Or is it now up to the community to drive it and is given up?

@raoofm we are not giving up on anything but we just simply don't have the man power to cover all of these bases. As you are in the trenches here bringing these problems to our attention is helpful. But also we need more cycles so anything you can do to be a bridge with that would be great. This issue seems to be misconfigured TLS SAN.

@yudiandreanp what is the output of

openssl x509 -in /etcd/client.crt -text -noout

raoofm commented 5 years ago

@hexfusion awesome, thanks and it makes sense. I'll pitch in where I can just wanted to sense where its heading.

jsok commented 5 years ago

@hexfusion there's a more comprehensive description of the issue in general over at https://github.com/etcd-io/etcd/issues/9949 where I think the attention should be focused.

hexfusion commented 5 years ago

@jsok at a high level we are working on improving client balancer for 3.4 but basically clientv3 needs to handle this situation better with regards to the balancer logic. So if an endpoint is not available it goes to the next.

2018-04-13 08:01:50.267438 I | embed: rejected connection from "10.10.145.64:32946" (error "remote error: tls: bad certificate", ServerName "etcd-0.example.com")

I believe though regardless of the general balancer issue that these errors are literal, basically it is telling you that you should have etcd-0.example.com in your TLS SAN and you do not. So I believe you have 2 separate issues here so I would like to review output of the openssl command above.

https://github.com/etcd-io/etcd/blob/ae25c5e1320f731a2ffaafbf756aca8b0a94dfab/Documentation/op-guide/security.md#notes-for-tls-authentication

jsok commented 5 years ago

I'm fairly certain the issue here is that the first endpoint that the client balancer hits determines the expected ServerName for all other endpoints, which doesn't make sense.

You shouldn't expect every peer to have every other peer's FQDN and/or IP in their SAN. Yes they will have a common subset of SANs (e.g. for SRV discovery) but not identical.

prudnitskiy commented 5 years ago

Is there any news on this case? Have exactly the same problem. vault 1.0.2 (latest from site), etcd 3.3.8.

I tried to regenerate vault's client certificate with all hostnames and IPs in SAN, but have no luck with it.

jefferai commented 5 years ago

@prudnitskiy The etcd team has disappeared and I don't believe anyone in the community has worked on a fix, so no updates currently.

jsok commented 5 years ago

@prudnitskiy I've outlined the workarounds in the etcd issue. It's not pretty but does work for the time being.

hashicorp / vault

Vault with etcd backend storage with multiple adresses uses only the first server's ServerName for all endpoints, generating bad cert tls error on etcd side #4349