hashicorp / terraform-aws-vault

A Terraform Module for how to run Vault on AWS using Terraform and Packer
Apache License 2.0
656 stars 465 forks source link

Vault service isn't registered in consul. UI not available via vault.service.consul #228

Closed queglay closed 3 years ago

queglay commented 3 years ago

I've been having some trouble being able to access the UI via https://vault.service.consul/ui in a private subnet.

I may be wrong, but I believe the examples showing HA and also the Private subnet example do not register the vault service with consul. So unless you are using an ELB, you wont have the vault.service.consul FQDN to utilise the vault UI.

I am very new to Vault, but I'd imagine this would be problematic for others trying to use a redirect for a private cluster, OIDC, or use the dnsmasq scripts.

I tried to search for where this configuration block for the service registration would be specified in the repo, but couldn't find it:

service_registration "consul"

https://www.vaultproject.io/docs/configuration/service-registration/consul

Is this being specified anywhere or should it be by default? Or is there anywhere in the repo that the vault service is getting registered with consul and I'm missing it?

I also opened a thread on hashicorp discuss yesterday before I realised this might be extending to other functions in this repository as well. https://discuss.hashicorp.com/t/why-might-a-consul-client-not-be-able-to-access-vault-ui-at-https-vault-service-consul-ui/17660

Thanks if anyone can provide any clues!

queglay commented 3 years ago

Although I probably shouldn't do it this way, I tested appending the info into the run-vault script, and so far so good, the dig command pulls up an answer, and the web browser does resolve at this address. I'm not sure if more is required to eliminate the https certificate warnings though, I'd like to figure that out.

    vault_storage_backend=$(cat <<EOF
$consul_storage_type "consul" {
  address = "127.0.0.1:8500"
  path    = "vault/"
  scheme  = "http"
  service = "vault"
}
# HA settings
cluster_addr  = "https://$instance_ip_address:$cluster_port"
api_addr      = "$api_addr"

service_registration "consul" {
  address = "127.0.0.1:8500"
  service = "vault"
  scheme  = "http"
}
EOF
)
brikis98 commented 3 years ago

This is probably the same cause as https://github.com/hashicorp/terraform-aws-vault/issues/223.

brikis98 commented 3 years ago

IIRC, using Consul as a backend and specifying a service name, should result in Vault being registered with Consul: https://github.com/hashicorp/terraform-aws-vault/blob/master/modules/run-vault/run-vault#L323. But perhaps some behavior changed to break that?

brikis98 commented 3 years ago

The service registration docs even say:

When Consul is configured as the storage backend, Vault implicitly uses Consul for service registration, so the service_registration stanza is not needed.

So there must be some other issue going on...

brikis98 commented 3 years ago

If someone has time to dig into this and figure out what is going on, a PR is very welcome!

queglay commented 3 years ago

I'm a bit hesitant to make a PR of this approach because I also read somewhere in Hashicorp docs that when we use Consul as a storage backend, that the same cluster should not be used for service discovery due to load then being able to influence vault throughput (I cannot remember where). It's possible then that what I've done above to get it working might not be an optimal workflow, but at least for small scale operations perhaps it is fine? I don''t know, but If it was indeed an acceptable solution I would comment with that warning.

brikis98 commented 3 years ago

Yea, I mean a PR that fixes the issue that made service registration stop working... As I wrote above, adding a service_registration is probably not the right solution for that PR.

queglay commented 3 years ago

I should add that the vault version I am using to build the AMI's is v1.5.5

    "vault_version": "1.5.5",
    "consul_module_version": "v0.8.0",
    "consul_version": "1.8.4",
queglay commented 3 years ago

This may be related to ubuntu 18 only (My vault cluster is using Ubuntu 18). I encountered something here with a client I tried to get going ( https://github.com/hashicorp/terraform-aws-consul/issues/198 ) that made me wonder if its the fact that I see dig vault.service.consul default to 127.0.0.53 and not produce a result.

ubuntu@ip-10-4-2-183:~$ dig vault.service.consul

; <<>> DiG 9.11.3-1ubuntu1.13-Ubuntu <<>> vault.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 20160
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;vault.service.consul.          IN      A

;; Query time: 3 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Sun Dec 20 01:24:46 UTC 2020
;; MSG SIZE  rcvd: 49

ubuntu@ip-10-4-2-183:~$ dig @localhost vault.service.consul

; <<>> DiG 9.11.3-1ubuntu1.13-Ubuntu <<>> @localhost vault.service.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39014
;; flags: qr aa rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;vault.service.consul.          IN      A

;; ANSWER SECTION:
vault.service.consul.   0       IN      A       10.4.2.183
vault.service.consul.   0       IN      A       10.4.2.46
vault.service.consul.   0       IN      A       10.4.1.247

;; Query time: 456 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sun Dec 20 02:03:10 UTC 2020
;; MSG SIZE  rcvd: 97
brikis98 commented 3 years ago

Yes, we're seeing Ubuntu 18 specific issues here with the DNS code. See https://github.com/hashicorp/terraform-aws-vault/issues/223.

brikis98 commented 3 years ago

This should've been fixed in #232 and released in https://github.com/hashicorp/terraform-aws-vault/releases/tag/v0.14.2.

queglay commented 3 years ago

I updated and tested today, but found that a brand new vault cluster (before being initialised), whilst showing consul services and nodes, retrieved nothing via dig.

Before I was using: vault-module-version: v0.13.11 vault-version: 1.5.5 consul-module-version: v0.8.0 consul-version: 1.8.4

And I tested today: vault-module-version: v0.15.1 vault-version: 1.6.1 consul-module-version:v0.8.0 consul-version:1.9.2

What else could I check to further diagnose the problem?

admin:~/environment/firehawk/vault-init (bump-versions) $ ssh ubuntu@10.1.0.19
Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1048-aws x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Sun May 16 14:44:01 UTC 2021

  System load:  0.0               Processes:           96
  Usage of /:   4.8% of 48.41GB   Users logged in:     0
  Memory usage: 32%               IP address for eth0: 10.1.0.19
  Swap usage:   0%

New release '20.04.2 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Last login: Sun May 16 14:37:23 2021 from 172.31.0.64
ubuntu@ip-10-1-0-19:~$ consul catalog services
consul
vault
ubuntu@ip-10-1-0-19:~$ dig vault.service.consul

; <<>> DiG 9.11.3-1ubuntu1.15-Ubuntu <<>> vault.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 3005
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;vault.service.consul.          IN      A

;; Query time: 2 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Sun May 16 14:44:19 UTC 2021
;; MSG SIZE  rcvd: 49

ubuntu@ip-10-1-0-19:~$ dig @localhost vault.service.consul

; <<>> DiG 9.11.3-1ubuntu1.15-Ubuntu <<>> @localhost vault.service.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 45282
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;vault.service.consul.          IN      A

;; AUTHORITY SECTION:
consul.                 0       IN      SOA     ns.consul. hostmaster.consul. 1621176586 3600 600 86400 0

;; Query time: 24 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sun May 16 14:49:46 UTC 2021
;; MSG SIZE  rcvd: 99

ubuntu@ip-10-1-0-19:~$ vault status
Key                      Value
---                      -----
Recovery Seal Type       awskms
Initialized              false
Sealed                   true
Total Recovery Shares    0
Threshold                0
Unseal Progress          0/0
Unseal Nonce             n/a
Version                  1.6.1
Storage Type             s3
HA Enabled               true
ubuntu@ip-10-1-0-19:~$ 

I should not that updating did allow my infra to function normally with an existing vault configuration (s3 backend). But this problem became evident when testing from scratch.

queglay commented 3 years ago

It was my mistake, it appears after installing dnsmasq a reboot was required which I never knew about. I added a request to improve the installer to avoid that here (hopefully just some services need to be restarted) -

https://github.com/hashicorp/terraform-aws-consul/issues/224