hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.79k stars 4.17k forks source link

SIGHUP race condition on start #27100

Open colinleroy opened 3 months ago

colinleroy commented 3 months ago

If the vault process receives a SIGHUP early in its start process, the signal is not handled and the vault process dies.

To Reproduce

root@vault-server-03:~# systemctl restart vault.service && systemctl reload vault.service 

root@vault-server-03:~# systemctl status vault.service
○ vault.service - "HashiCorp Vault - A tool for managing secrets"
     Loaded: loaded (/etc/systemd/system/vault.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Fri 2024-05-17 09:49:17 CEST; 15s ago
       Docs: https://www.vaultproject.io/docs/
    Process: 2324 ExecStart=/usr/local/bin/vault server -config=/etc/vault.d/conf/vault.hcl (code=killed, signal=HUP)
    Process: 2327 ExecReload=/bin/kill --signal HUP $MAINPID (code=exited, status=0/SUCCESS)
   Main PID: 2324 (code=killed, signal=HUP)
        CPU: 109ms

May 17 09:49:17 vault-server-03 systemd[1]: Started "HashiCorp Vault - A tool for managing secrets".
May 17 09:49:17 vault-server-03 systemd[1]: Reloading "HashiCorp Vault - A tool for managing secrets"...
May 17 09:49:17 vault-server-03 systemd[1]: Reloaded "HashiCorp Vault - A tool for managing secrets".
May 17 09:49:17 vault-server-03 systemd[1]: vault.service: Deactivated successfully.

root@vault-server-03:~# ps aux|grep vault
root@vault-server-03:~#

Expected behavior I expect nothing happens if vault receives a SIGHUP while it is starting.

Environment:

Vault server configuration file(s):

ui = true

listener "tcp" {
  address = "vault-server-03.***:8200"
  cluster_address = "vault-server-03.***:8201"
  tls_disable = 0
  tls_cert_file = "/usr/local/share/ca-certificates/***.fullchain.crt"
  tls_key_file = "/etc/ssl/private/***.key"
  telemetry {
    unauthenticated_metrics_access = true
  }
}

storage "consul" {
  address = "localhost:8500"
  scheme = "http"
  path    = "vault/"
  token   = "********-****-****-****-************"
  service = "vault"
  service_address = ""
  service_tags = "vault"
}

api_addr = "https://vault-server-03.***:8200"
cluster_addr = "https://vault-server-03.***:8201"

log_format = "json"
log_level = "info"

telemetry {
  prometheus_retention_time = "30s",
  disable_hostname = true
}

seal "gcpckms" {
    project = "***"
    region = "***"
    key_ring = "***"
    crypto_key = "***"
}
colinleroy commented 3 months ago

(In the meantime we've changed our systemd script to Restart=always instead of Restart=on-failure)

heatherezell commented 3 months ago

Can you provide more context around this use case? In my own personal experience, I'm not sure I would expect a process that isn't fully started to handle a signal. Thanks! :)

colinleroy commented 3 months ago

Can you provide more context around this use case? In my own personal experience, I'm not sure I would expect a process that isn't fully started to handle a signal. Thanks! :)

The use case is a simple "boot the vault server machine after it's been shutdown". systemd starts services including vault, and logrotate. logrotate rotates vault's logs and sends a SIGHUP. It worked OK so far, and a dist-upgrade changed the timing sufficiently that the timing is bad and the vault process gets killed.

May 15 09:24:35 vault-server-03-preproduction systemd[1]: vault.service: Deactivated successfully.
May 15 09:24:36 vault-server-03-preproduction systemd[1]: logrotate.service: Succeeded.

I'm sure there is no way to completely close the race window, as you say a non-fully-started service may not have installed its sighandler. But maybe there's a way to reduce it by installing the sighandler first-thing ?

heatherezell commented 3 months ago

Hmm! That does sound frustrating. In researching this, I had ended up down a rabbit hole of "The History of SIGHUP", many pages of which had photos of an acoustic coupler modem. It seems "reload configs" is a fairly recent method of handling it. :) But I'll ask our engineers if the sighandler can be pushed up earlier in the startup procedures.