ChrisLundquist commented 7 years ago

Vault version:

root@vault-1:~# vault -version
Vault v0.6.2

config:

{
  "backend": {
    "consul": {
      "address": "127.0.0.1:8500", 
      "path": "vault"
    }
  }, 
  "disable_mlock": true, 
  "listener": {
    "tcp": {
      "address": "0.0.0.0:8200", 
      "tls_disable": 1
    }
  }
}

My cluster looks like:

$ lxc list
+-----------------+---------+-----------------------+-------------------------------------------+------------+-----------+
|      NAME       |  STATE  |         IPV4          |                   IPV6                    |    TYPE    | SNAPSHOTS |
+-----------------+---------+-----------------------+-------------------------------------------+------------+-----------+
| consul-server-1 | RUNNING | 10.170.131.180 (eth0) | <redacted> (eth0) | PERSISTENT | 0         |
+-----------------+---------+-----------------------+-------------------------------------------+------------+-----------+
| consul-server-2 | RUNNING | 10.170.131.200 (eth0) | <redacted> (eth0) | PERSISTENT | 0         |
+-----------------+---------+-----------------------+-------------------------------------------+------------+-----------+
| consul-server-3 | RUNNING | 10.170.131.244 (eth0) | <redacted> (eth0) | PERSISTENT | 0         |
+-----------------+---------+-----------------------+-------------------------------------------+------------+-----------+
| vault-1         | RUNNING | 10.170.131.69 (eth0)  | <redacted> (eth0) | PERSISTENT | 0         |
+-----------------+---------+-----------------------+-------------------------------------------+------------+-----------+
| vault-2         | RUNNING | 10.170.131.94 (eth0)  | <redacted> (eth0) | PERSISTENT | 0         |

Consul config:

root@consul-server-1:~# cat /etc/consul.d/config.json 
{"bind_addr": "0.0.0.0", "bootstrap_expect": 3, "data_dir": "/var/consul", "datacenter": "main", "enable_debug": true, "encrypt": "", "log_level": "info", "retry_interval": "15s", "retry_join": [], "server": true, "ui": true}
root@consul-server-1:~# consul members
Node             Address              Status  Type    Build  Protocol  DC
consul-server-1  10.170.131.180:8301  alive   server  0.7.0  2         main
consul-server-2  10.170.131.200:8301  alive   server  0.7.0  2         main
consul-server-3  10.170.131.244:8301  alive   server  0.7.0  2         main
vault-1          10.170.131.69:8301   alive   client  0.7.0  2         main
vault-2          10.170.131.94:8301   alive   client  0.7.0  2         main

Repro case:

Both unsealed --1 is the master.

step down on -1


root@vault-1:~# vault status
Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0
Version: 0.6.2
Cluster Name: vault-cluster-7398db7b
Cluster ID: a8cd3f90-1a4c-5e5d-8a64-06af14cead32

High-Availability Enabled: true Mode: active Leader: http://10.170.131.69:8200

root@vault-1:~# vault step-down

root@vault-1:~# vault status Sealed: false Key Shares: 5 Key Threshold: 3 Unseal Progress: 0 Version: 0.6.2 Cluster Name: vault-cluster-7398db7b Cluster ID: a8cd3f90-1a4c-5e5d-8a64-06af14cead32

High-Availability Enabled: true Mode: standby Leader: http://10.170.131.94:8200 root@vault-1:~# vault read secret/crash-worker/foo ^C

Hang


On the other node:

root@vault-2:~# vault status Sealed: false Key Shares: 5 Key Threshold: 3 Unseal Progress: 0 Version: 0.6.2 Cluster Name: vault-cluster-7398db7b Cluster ID: a8cd3f90-1a4c-5e5d-8a64-06af14cead32

High-Availability Enabled: true Mode: standby Leader: http://10.170.131.69:8200

< the other node steps down here >

root@vault-2:~# vault status Error checking seal status: Get http://127.0.0.1:8200/v1/sys/seal-status: net/http: request canceled (Client.Timeout exceeded while awaiting headers)


The hung vault leader is doing seemingly nothing:

root@vault-2:~# ps aux | grep vault vault 8535 0.0 0.1 37772 22332 ? Ssl 02:21 0:01 /usr/local/bin/vault server -config=/etc/vault.d


strace continues like this:

[pid 8536] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...> [pid 8604] futex(0x1df3998, FUTEX_WAIT, 0, {0, 999900405} <unfinished ...> [pid 8536] <... select resumed> ) = 0 (Timeout) [pid 8536] futex(0x1df4058, FUTEX_WAIT, 0, {60, 0} <unfinished ...> [pid 8538] <... epoll_wait resumed> [{EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=853363328, u64=140253010414208}}], 128, -1) = 1 [pid 8538] epoll_wait(4, <unfinished ...> [pid 8604] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 8604] futex(0x1df4058, FUTEX_WAKE, 1) = 1 [pid 8536] <... futex resumed> ) = 0 [pid 8604] futex(0xc4203d6910, FUTEX_WAKE, 1) = 1 [pid 8540] <... futex resumed> ) = 0 [pid 8536] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...> [pid 8604] futex(0x1df4048, FUTEX_WAIT, 0, {0, 100000}) = -1 EAGAIN (Resource temporarily unavailable) [pid 8540] futex(0x1df4048, FUTEX_WAKE, 1 <unfinished ...> [pid 8604] epoll_wait(4, <unfinished ...> [pid 8540] <... futex resumed> ) = 0 [pid 8604] <... epoll_wait resumed> [], 128, 0) = 0 [pid 8540] futex(0xc4203d6910, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 8536] <... select resumed> ) = 0 (Timeout) [pid 8604] futex(0xc4203d6910, FUTEX_WAKE, 1 <unfinished ...> [pid 8540] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 8604] <... futex resumed> ) = 0 [pid 8536] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...> [pid 8540] futex(0xc4200be510, FUTEX_WAKE, 1) = 1 [pid 8539] <... futex resumed> ) = 0 [pid 8540] futex(0x1df3998, FUTEX_WAIT, 0, {0, 889635971} <unfinished ...> [pid 8539] futex(0xc4200be510, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 8536] <... select resumed> ) = 0 (Timeout) [pid 8604] futex(0xc4203d6d10, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 8536] futex(0x1df4058, FUTEX_WAIT, 0, {60, 0}^Cstrace: Process 8535 detached


I tried stopping the old primary `-1` in this case, to see if it held on to the consul lock, but that wasn't the case.

Originally, both were configured to listen on `127.0.0.1`, and I thought it may be an issue related to both masters thinking they were `127.0.0.1`. That doesn't seem to be the case though, as now they are both listening on `0.0.0.0` and see the other's IPv4 addresses.

*Note* configuring `0.0.0.0` actually binds to `::0` or in6_addr_any. [This is a golang thing](https://github.com/golang/go/issues/7411)

jefferai commented 7 years ago

Hi Chris,

Can you SIGQUIT the hung process? That will be much more useful than the strace.

Thanks!

ChrisLundquist commented 7 years ago

Hey Jeff! That was fast!

Here is the output I found in syslog: https://gist.github.com/ChrisLundquist/8c9d9093ada98cb139b07de29fcdbbc8

ChrisLundquist commented 7 years ago

I inspected the stack trace to no avail. I tried getting just -2 to run, and had trouble.

In syslog, it complained about:

Nov 11 03:58:49 vault-2 vault[8958]: 2016/11/11 03:58:49.431173 [ERROR] core: failed to create audit entry: path=file/ error=sanity check failed; unable to open /var/log/vault/audit.log for writing: mkdir /var/log/vault: permission denied

created it with:

mkdir /var/log/vault
chown vault: /var/log/vault

stopped -1 restarted -2 -2 took active in a few seconds vault step-down now works as advertised, and can swap back and forth.

I believe I got into this state because I enabled the file audit on -1, and the directory didn't exist on -2

Is such a thing possible?

I'm a little surprised that everything appeared fine, until it tried to take active.

Edit: To be clear. I enabled file audit and configured it on -1 previously, everything was fine. Then because -2 was brought up later, it didn't have the log directory.

I guess what I'm saying, is that seems that the file audit backend can disrupt the failover procedure. This is a little surprising, it kinda makes sense, but it seems a little risky and difficult to figure out.

jefferai commented 7 years ago

Is such a thing possible?

It absolutely is. If auditing is enabled it must succeed to at least one sink, otherwise requests hang until it becomes available (IOW it will keep retrying.) This is because auditing is considered super duper important so Vault will do everything possible to make sure every entry is logged.

Can it block normal failover, sure, but that's what testing is for :-)

ChrisLundquist commented 7 years ago

It absolutely is. If auditing is enabled it must succeed to at least one sink, otherwise requests hang until it becomes available (IOW it will keep retrying.)

It looks like it must succeed to all.

root@vault-2:~# vault audit-list
Path     Type    Description  Options
file/    file                 file_path=/var/log/vault/audit.log
syslog/  syslog               facility=AUTH tag=vault

/var/log/auth.log has stuff like you'd expect:

Nov 11 04:39:20 vault-2 vault[9044]: {"time":"2016-11-11T04:39:20Z","type":"response","error":"","auth":{"client_token":"","accessor":"","display_name":"token","policies":["root"],"metadata":null},"request":{"id":"a48400cb-c49e-4ade-9519-b1e6cd75cca3","operation":"read","client_token":"hmac-sha256:1b404cc5ec6b3d1b3119697e801d6c8cb2df6f19b9606c63e46af07048205d6d","path":"sys/audit","data":null,"remote_address":"127.0.0.1","wrap_ttl":0},"response":{"data":{"file/":{"description":"hmac-sha256:e96030ff3bd6ee0ef785e16cb8fa3e6b815a99cb1e52fbc1f25a1aca6e74714c","options":{"file_path":"hmac-sha256:6215789311184935c39cc7a3bea8da4f75bcfcedd6450f7ba62bf66d3df12a82"},"path":"hmac-sha256:af12067e82d87bb3cb48943819c6ec5dea957bb641ff33c4da849a3d35dba3bb","type":"hmac-sha256:d2b674addfd2efbe14a8eea3dfbbe1614247e95ec3e24e71449fc17ec1e0405f"},"syslog/":{"description":"hmac-sha256:e96030ff3bd6ee0ef785e16cb8fa3e6b815a99cb1e52fbc1f25a1aca6e74714c","options":{"facility":"hmac-sha256:2461ee16cafe7e027760eacd375f252347c4365c158a9f2614d7d103d3dc390b","tag":"hmac-sha256:0c4c6fe5a068b865418b9dfa4572f7d9b58dcc2b15c94c9cb39e84014b7931b9"},"path":"hmac-sha256:4a4c8502c07df57b9f1426145cf8be41b3244406845bc32db1f868d60e5f9c0a","type":"hmac-sha256:59c1a9320727db62d4cd3608ab2a15954bcc1fb58cc0db44d65e84db617453e8"}}}}

ChrisLundquist commented 7 years ago

Do you think it would be practical or possible to perform preflight checks like this in vault status?

jefferai commented 7 years ago

@ChrisLundquist I can't repro behavior where all logs must succeed.

I created two audit logs, verified that both were logging, changed permissions on one to make it non-writable, HUP'd Vault, and I could see errors in the log from not being able to write to the inaccessible log file, but operations succeeded.

ChrisLundquist commented 7 years ago

Hmm, I'm still a little confused here as I think I have a repro case, when I remove /var/log/vault from the stand-by node, then step down the active node, vault gets stuck.

Maybe I misunderstood. Is the requirement:

at least one destination for each enabled audit backend? (e.g file /var/log/vault, file /mnt/logs/vault/, one must work)

or is it

at least one audit backend (e.g. syslog and file are both enabled, file doesn't work, but syslog works)

The repro case seems to be: The above setup, consul backend, all that.

root@vault-2:~# vault audit-list
Path     Type    Description  Options
file/    file                 file_path=/var/log/vault/audit.log
syslog/  syslog               facility=AUTH tag=vault

mv /var/log/vault /tmp/ on the stand by (notice in the previous post, the was was stuck on mk_dir, rather than open)
on the primary vault step-down This node will recover, and forward all requests to the master, requests will hang though other than status.
On the other node, even status hangs.
logs aren't reaching /var/log/auth.log
try mv /tmp/vault /var/log/ no effect.
restart vault on the hung node
see logs in both /var/log/auth.log and /var/log/vault/audit.log

Again, if the requirement is one destination for each backend must work, this would be consistent.

I can't recall hearing or reading about a gotchya like this though in the docs. https://www.vaultproject.io/docs/internals/high-availability.html https://www.vaultproject.io/docs/concepts/ha.html don't seem to mention audit other than the navigation bar.

I guess, I'm happy to chalk this up to some weirdness in my config, it seems like it is reproducible to me, or there is a constraint I'm missing. I'm unblocked and can move on (but still a little surprised and confused at the behavior,[that there are no forward breadcrumbs, not that audits are important]), so if you're satisfied on your side, feel free to resolve/close this issue.

Fan Rant: I really like Vault so far, and enjoyed your talk on secure introduction (I had just hit the SI problem when a team wanted to use kubernetes), Kelsey's talk on 12 factor hashiapps, and Seth's talk at Goto London. (The Q&As of Seth's clarified "backend" vs "secret backend") Thank you so much for working on Vault and your help figuring this out.

jefferai commented 7 years ago

Hi @ChrisLundquist ,

The behavior is supposed to be option 2 -- if any work that's enough. Looking at the code, that ought to also be what happens.

I am wondering at this point if the audit logging is actually somewhat of a red herring. status doesn't audit log as it's unauthenticated, so if status is failing on the hung node it suggests that perhaps the real problem lies somewhere else. I will try to dig into this before 0.6.3.

jefferai commented 7 years ago

@ChrisLundquist I took another look at that stack trace. I'm guessing that was from the standby node? Because there are multiple acquisitions of the state lock in there but they are all read locks.

If it's easy for you to reproduce any chance you can call SIGQUIT on the hung active node that isn't accepting status from either the standby or directly?

(Also if it's easy to get the logs directly instead of in syslog the formatting makes it much nicer to read/follow, but no big deal if not.)

Thanks!

ChrisLundquist commented 7 years ago

@jefferai The stack trace was from the hung node. It hangs in the transition from standby to active.

This is a test setup at home, if it would help, I am willing to give you SSH access to the setup.

I've added the cleaner output to: https://gist.github.com/ChrisLundquist/8c9d9093ada98cb139b07de29fcdbbc8

One thing I noted was that, initially, I ran vault as root, so it had permissions to make /var/log/vault/audit.log, the folder got recreated when step-down happened. (Edit: and the fail-over succeeded)

when I ran it with sudo -u vault, the hang happened.

ChrisLundquist commented 7 years ago

2016/11/11 19:52:27.288735 [WARN ] physical/consul: appending trailing forward slash to path
2016/11/11 19:52:38.270294 [INFO ] core: vault is unsealed
2016/11/11 19:52:38.270349 [WARN ] physical/consul: Concurrent sealed state change notify dropped
2016/11/11 19:52:38.270456 [INFO ] core: entering standby mode

2016/11/11 19:53:46.515299 [INFO ] core: acquired lock, enabling active operation
2016/11/11 19:53:46.633266 [WARN ] physical/consul: Concurrent state change notify dropped
2016/11/11 19:53:46.633296 [INFO ] core: post-unseal setup starting
2016/11/11 19:53:46.639404 [INFO ] core: successfully mounted backend: type=system path=sys/
2016/11/11 19:53:46.639466 [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2016/11/11 19:53:46.639532 [INFO ] core: successfully mounted backend: type=consul path=consul/
2016/11/11 19:53:46.639588 [INFO ] core: successfully mounted backend: type=generic path=secret/
2016/11/11 19:53:46.639651 [INFO ] core: successfully mounted backend: type=mysql path=mysql/
2016/11/11 19:53:46.639712 [INFO ] core: successfully mounted backend: type=mysql path=db/crash-worker/
2016/11/11 19:53:46.639763 [INFO ] core: successfully mounted backend: type=mysql path=db/collab-service/
2016/11/11 19:53:46.639884 [INFO ] rollback: starting rollback manager
2016/11/11 19:53:46.661040 [INFO ] expire: leases restored: restored_lease_count=4
2016/11/11 19:53:46.662942 [ERROR] core: failed to create audit entry: path=file/ error=sanity check failed; unable to open /var/log/vault/audit.log for writing: mkdir /var/log/vault: permission denied
2016/11/11 19:53:46.662973 [INFO ] core: pre-seal teardown starting
2016/11/11 19:53:46.662983 [INFO ] core/stopClusterListener: stopping listeners

Appears to be the key.

My interpretation is:

We're in standby, and unsealed
the other node steps down, we call postUnseal
postUnseal calls setupAudits
setupAudits returns an error on the first audit backend, before trying others

Note:

root@vault-2:~# vault audit-list
Path     Type    Description  Options
file/    file                 file_path=/var/log/vault/audit.log
syslog/  syslog               facility=AUTH tag=vault

file is first. presumably it is tried in the same order.

postUnseal then forward the error message:

/* postUnseal */
1146     if err := c.setupAudits(); err != nil {
1147         return err
1148     }

I'm not clear on who is the caller of postUnseal though.

It seems to be runStandby because we see core: acquired lock, enabling active operation in the log, but I don't see the below error message:

1276         // Attempt the post-unseal process
1277         err = c.postUnseal()
1278         if err == nil {
1279             c.standby = false
1280         }
1281         c.stateLock.Unlock()
1282 
1283         // Handle a failure to unseal
1284         if err != nil {
1285             c.logger.Error("core: post-unseal setup failed", "error", err)
1286             lock.Unlock()
1287             metrics.MeasureSince([]string{"core", "leadership_setup_failed"}, activeTime)
1288             continue
1289         }

presumably we're stuck in a cleanup state here, as other go routines are locked on: c.stateLock.RLock()

jefferai commented 7 years ago

github.com/hashicorp/vault/vault.(*Core).stopClusterListener(0xc420072700)
    /gopath/src/github.com/hashicorp/vault/vault/cluster.go:323 +0xf4
github.com/hashicorp/vault/vault.(*Core).preSeal(0xc420072700, 0x0, 0x0)
    /gopath/src/github.com/hashicorp/vault/vault/core.go:1178 +0x6c9

This appears to be the problem; it seems that the channel has not consumed the existing value, so sending a shutdown blocks because the channel is unbuffered.

jefferai commented 7 years ago

Any chance you can build on your own and test that branch out? Otherwise I could send you a binary. I think that should fix it.

ChrisLundquist commented 7 years ago

The behavior is much improved. It looks like the node tries to become master, fails a few times, then the working node grabs the lock again.

linked here: https://gist.github.com/ChrisLundquist/8c9d9093ada98cb139b07de29fcdbbc8#file-vault-patched-output

It still seems that all audit sinks are required for a successful promotion to active. (Edit: or maybe at least the first sink is required) The current patch is a big win though, as the cluster is able to recover automatically.

jefferai commented 7 years ago

Yep, the bit about all audit sinks setting up successfully is something I'll address separately -- I want to discuss internally about if we actually want to change that (arguably if we know that the node taking on active duty is in a bad state, we should fail so fhat another node can acquire the lock; but I realize there are failure scenarios either way).

ChrisLundquist commented 7 years ago

Thank you for taking a look at this and hearing me out. It is clear how much pride and joy you have in vault!

ChrisLundquist commented 7 years ago

I migrated my containers between hosts and IPs in a poor fashion, and I discovered some things. Since it was a test cluster I didn't take care, and consul lost quorum. After recovering consul and restarting vault I encountered a strange behavior.

root@vault-1:~# vault list secret/
Error reading secret/: EOF
root@vault-1:~# vault list secret/
Error reading secret/: EOF

Looking at vault status you could see leader flapping since I only had setup the one broken node. Tracking down the logs, we could see the error we debugged above:

Nov 28 22:06:37 vault-1 vault[16152]: 2016/11/28 22:06:37.984102 [ERROR] core: failed to create audit entry: path=file/ error=sanity check failed; unable to open /var/log/vault/audit.log for writing: mkdir /var/log/vault: permission denied
Nov 28 22:06:37 vault-1 vault[16152]: 2016/11/28 22:06:37.984121 [INFO ] core: pre-seal teardown starting
Nov 28 22:06:37 vault-1 vault[16152]: 2016/11/28 22:06:37.984127 [INFO ] core/stopClusterListener: listeners not running
Nov 28 22:06:37 vault-1 vault[16152]: 2016/11/28 22:06:37.984141 [INFO ] rollback: stopping rollback manager
Nov 28 22:06:37 vault-1 vault[16152]: 2016/11/28 22:06:37.984160 [INFO ] core: pre-seal teardown complete
Nov 28 22:06:37 vault-1 vault[16152]: 2016/11/28 22:06:37.984169 [ERROR] core: post-unseal setup failed: error=failed to setup audit table

Mostly I wanted to document getting EOF as an edge case for anyone else that hits this in the future.

I will say that, having consul lose quorum and vault recovering virtually seamlessly is pretty awesome. If I didn't leave the node in a broken state; I wouldn't have hit any issues. These are the sorts of things that help me sleep at night! :)

hashicorp / vault

Setting up audit sinks can block unseal #2083

Hang

< the other node steps down here >