hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.21k stars 4.22k forks source link

Problem Migrating to Kubernetes Due to Audit Configuration #10597

Open aekrohn opened 3 years ago

aekrohn commented 3 years ago

Describe the bug After performing a migration of prod data and unsealing Vault, pods in HA configuration are unable to elect a leader.

(NOTE: a workaround for this problem is at the end of this post)

The following log files indicate the specific issue, but it was difficult to determine what the core problem was:

2020-12-17T17:55:21.786Z [INFO]  rollback: starting rollback manager
2020-12-17T17:55:21.790Z [ERROR] core: failed to create audit entry: path=syslog/ error="Unix syslog delivery error"
2020-12-17T17:55:21.790Z [INFO]  core: pre-seal teardown starting
2020-12-17T17:55:21.790Z [WARN]  expiration: context canceled while restoring leases, stopping lease loading
2020-12-17T17:55:21.800Z [INFO]  rollback: stopping rollback manager
2020-12-17T17:55:21.800Z [INFO]  core: pre-seal teardown complete 
2020-12-17T17:55:21.800Z [ERROR] core: post-unseal setup failed: error="failed to setup audit table"

To Reproduce Steps to reproduce the behavior:

  1. Create a HA Kubernetes deployment using the Vault Helm chart
  2. Configure and run a migration whose source data is from a Vault that is configured to use syslog for auditing
  3. Unseal Vault using keys from migration source installation
  4. HA pair will never decide on a leader, preventing any further interaction with new Vault

Expected behavior

Environment:

Vault Helm chart values:

        global:
          enabled: true
          tlsDisable: false
        server:
          enabled: true
          ha:
            enabled: true
            config: |
              ui = true
              api_addr = "https://POD_IP:8200"
              listener "tcp" {
                  address     = "0.0.0.0:8200"
                  tls_disable = "false"
                  tls_cert_file = "/vault/userconfig/vault-dev-ui-tls/tls.crt"
                  tls_key_file = "/vault/userconfig/vault-dev-ui-tls/tls.key"
                  tls_min_version = "tls12"
                  tls_client_ca_file = "/vault/userconfig/vault-client-ca-tls-data/tls-client-ca-cert-file"
                  tls_disable_client_certs = "false"
              }
              storage "consul" {
                  path = "vault"
                  address = "HOST_IP:8501"
                  scheme = "https"
                  tls_cert_file = "/vault/userconfig/consul-tls-data/tls-cert-file"
                  tls_key_file = "/vault/userconfig/consul-tls-data/tls-key-file"
                  tls_ca_file = "/vault/userconfig/consul-tls-data/tls-ca-cert-file"
              }
          ingress:
            enabled: false
          extraEnvironmentVars:
            VAULT_CACERT: /vault/userconfig/vault-client-ca-tls-data/tls-client-ca-cert-file
            VAULT_SEAL_TYPE: shamir
          extraSecretEnvironmentVars:
            - envName: CONSUL_HTTP_TOKEN
              secretName: consul-access-token
              secretKey: consul.token
          service:
            port: 8200
          auditStorage:
            enabled: true
          standalone:
            enabled: false
          readinessProbe:
            enabled: true
            path: /v1/sys/health?standbyok=true
            failureThreshold: 2
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 3
          extraVolumes:
            - type: secret
              name: consul-tls-data
            - type: secret
              name: vault-client-ca-tls-data
            - type: secret
              name: vault-dev-ui-tls
            - type: secret
              name: migration-config
        ui:
          enabled: true
          publishNotReadyAddresses: true
          activeVaultPodOnly: false
          serviceType: "LoadBalancer"
          serviceNodePort: null
          externalPort: 8200
          loadBalancerSourceRanges:
            - 10.0.0.0/8
          annotations:
            service.beta.kubernetes.io/aws-load-balancer-internal: "true"

Workaround This problem can be solved by disabling syslog auditing on the source Vault deployment, running the migration, and then enabling syslog auditing again after migration is completed. This is fine for some environments, but probably not all, and would almost certainly raise some questions during a formal audit.

aphorise commented 2 years ago

@aekrohn do you agree that the proper order to do this would actually be enable an additional audit to stdout before disabling the syslog audit and then proceeding with the migration. This way there would be no loss at source nor any downtime on the K8S destination since stdout is common output strategy there:

vault audit enable -path=file_stdout file file_path=stdout ;
# // vault audit list && vault audit disable ...

There's also the recovery approach as a worse case scenario if you have no means of going back where you could disable things like they've shown on this Support KB article:

@aekrohn hey any ideas where in the docs you'd put this call out?

The last portion of ask I feel may be reasonable to provide a -audit-ignore-on-boot parameter or configuration setting for the purpose of booting especially in cases where audit devices / paths are broken beyond repair.

Ignore broken audit config settings in source data, even if that means auditing is no longer functional on target deployment. Emit a warning to notify user this has happened.