hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.05k stars 4.2k forks source link

TLS Handshake Error from ALB causing Vault to shutdown #7107

Closed nocode99 closed 4 years ago

nocode99 commented 5 years ago

Environment:

Vault Config File:

ui=true

storage "s3" {
  bucket = "my-s3-bucket"
  region = "us-east-1"
}

listener "tcp" {
  address     = "0.0.0.0:8200"
  tls_cert_file   = "/vault/config/ssl.com.crt"
  tls_key_file    = "/vault/config/ssl.com.key"

}

seal "awskms" {
  region = "us-east-1"
  kms_key_id = "123456-abcd-4293-bcd4-fdedbb6ec2cb"
}

Log Output:

Seal Type: awskms
Cgo: disabled
Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
Log Level: (not set)
Mlock: supported: true, enabled: true
Storage: s3
Version: Vault v1.0.2
Version Sha: 37a1dc9c477c1c68c022d2084550f25bf20cac33
==> Vault server started! Log data will stream in below:
2019-07-08T19:25:08.272Z [WARN] no `api_addr` value specified in config or in VAULT_API_ADDR; falling back to detection if possible, but this value should be manually set
2019-07-08T19:25:08.493Z [INFO] core: stored unseal keys supported, attempting fetch
2019-07-08T19:25:08.699Z [INFO] core: vault is unsealed
2019-07-08T19:25:08.763Z [INFO] core: post-unseal setup starting
2019-07-08T19:25:08.919Z [INFO] core: loaded wrapping token key
2019-07-08T19:25:08.919Z [INFO] core: successfully setup plugin catalog: plugin-directory=
2019-07-08T19:25:09.056Z [INFO] core: successfully mounted backend: type=generic path=secret/
2019-07-08T19:25:09.056Z [INFO] core: successfully mounted backend: type=system path=sys/
2019-07-08T19:25:09.057Z [INFO] core: successfully mounted backend: type=identity path=identity/
2019-07-08T19:25:09.057Z [INFO] core: successfully mounted backend: type=aws path=aws/
2019-07-08T19:25:09.057Z [INFO] core: successfully mounted backend: type=pki path=pki/
2019-07-08T19:25:09.057Z [INFO] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2019-07-08T19:25:09.627Z [INFO] core: successfully enabled credential backend: type=token path=token/
2019-07-08T19:25:09.627Z [INFO] core: successfully enabled credential backend: type=github path=github/
2019-07-08T19:25:09.627Z [INFO] core: successfully enabled credential backend: type=aws-ec2 path=aws-ec2/
2019-07-08T19:25:09.627Z [INFO] core: successfully enabled credential backend: type=aws path=aws/
2019-07-08T19:25:09.627Z [INFO] core: successfully enabled credential backend: type=userpass path=userpass/
2019-07-08T19:25:09.627Z [INFO] core: restoring leases
2019-07-08T19:25:09.627Z [INFO] rollback: starting rollback manager
2019-07-08T19:25:10.077Z [INFO] identity: entities restored
2019-07-08T19:25:10.089Z [INFO] identity: groups restored
2019-07-08T19:25:10.104Z [INFO] core: post-unseal setup complete
2019-07-08T19:25:10.104Z [INFO] core: successfully unsealed with stored key(s): stored_keys_used=1
2019-07-08T19:25:10.104Z [INFO] core: starting listener: listener_address=0.0.0.0:8201
2019-07-08T19:25:10.104Z [INFO] core: serving cluster requests: cluster_listen_address=[::]:8201
2019-07-08T19:25:11.592Z [INFO] expiration: lease restore complete
2019-07-08T19:25:27.700Z [INFO] expiration: revoked lease: lease_id=auth/aws/login/h7f877ff329cb7a592ede5202d7474559be8b1b0748203cdf2d3b0099f55573d8
2019-07-08T19:30:05.588Z [INFO] expiration: revoked lease: lease_id=auth/aws/login/hcadc26423ce08a208d59e45d20884b9ec52f11d0e5da03780c22c4936c518089
2019-07-08T19:30:05.680Z [INFO] expiration: revoked lease: lease_id=auth/aws/login/h042ec92de35d871374959bf4e6b9da5a88470093bf46d18aca1959b6cf532ea9
2019-07-08T19:30:17.113Z [INFO] expiration: revoked lease: lease_id=auth/aws/login/h6e5fc46365750b0bc527d994a0473ab9f1bb144891f67fbb0ba5c91033074880
2019-07-08T19:30:30.018Z [INFO] expiration: revoked lease: lease_id=auth/aws/login/haeb2af7e0cce4aca5ee99a06caf814a931f5872b50e3a5e291f3f58cf3218bcd
2019-07-08T19:30:32.798Z [INFO] expiration: revoked lease: lease_id=auth/aws/login/he96b9389cb9cc680fdb7c020ae40c12cdb8a81b0241aeb22016b6f187d00b968
2019-07-08T19:30:34.283Z [INFO] expiration: revoked lease: lease_id=auth/aws/login/hf305de331975f0c1d6341da94e11af184608216c7a48f1247d2b0730e79d8587
2019-07-08T19:30:35.470Z [INFO] expiration: revoked lease: lease_id=auth/aws/login/h062990c9fd2eaf130baa63277a4a9e21222f704ac99c389e140411464f348807
2019-07-08T19:30:35.728Z [INFO] expiration: revoked lease: lease_id=auth/aws/login/he21c076f4cecf6277773c8f11b15fceaf83f03b9480492609b4ced38dbf31424
2019-07-08T19:33:35.438Z [INFO] http: TLS handshake error from 10.99.68.184:9735: EOF
2019-07-08T19:33:39.874Z [INFO] http: TLS handshake error from 10.99.18.133:5130: EOF
2019-07-08T19:33:39.876Z [INFO] http: TLS handshake error from 10.99.68.184:9739: EOF
2019-07-08T19:33:40.081Z [INFO] http: TLS handshake error from 10.99.18.133:5138: EOF
2019-07-08T19:34:02.159Z [INFO] http: TLS handshake error from 10.99.68.184:9773: EOF
2019-07-08T19:35:05.429Z [INFO] http: TLS handshake error from 10.99.18.133:5284: EOF
2019-07-08T19:35:19.387Z [INFO] http: TLS handshake error from 10.99.18.133:5320: EOF
2019-07-08T19:35:19.866Z [INFO] http: TLS handshake error from 10.99.68.184:9907: EOF
2019-07-08T19:35:30.255Z [INFO] http: TLS handshake error from 10.99.68.184:9931: EOF
2019-07-08T19:35:35.235Z [INFO] http: TLS handshake error from 10.99.18.133:5344: EOF
2019-07-08T19:35:39.843Z [INFO] http: TLS handshake error from 10.99.68.184:9947: EOF
2019-07-08T19:35:42.015Z [INFO] http: TLS handshake error from 10.99.18.133:5352: EOF
2019-07-08T19:35:45.292Z [INFO] expiration: revoked lease: lease_id=auth/aws/login/h3101a21f290b9dc8f37676584487635909df2644089cd7551153b5163843b63f
==> Vault shutdown triggered
2019-07-08T19:35:45.937Z [INFO] core: marked as sealed
2019-07-08T19:35:45.937Z [INFO] core: pre-seal teardown starting
2019-07-08T19:35:45.937Z [INFO] core: stopping cluster listeners
2019-07-08T19:35:45.937Z [INFO] core: shutting down forwarding rpc listeners
2019-07-08T19:35:45.937Z [INFO] core: forwarding rpc listeners stopped
2019-07-08T19:35:46.140Z [INFO] core: rpc listeners successfully shut down
2019-07-08T19:35:46.140Z [INFO] core: cluster listeners successfully shut down
2019-07-08T19:35:46.187Z [INFO] rollback: stopping rollback manager
2019-07-08T19:35:46.197Z [INFO] core: pre-seal teardown complete
2019-07-08T19:35:46.197Z [INFO] core: vault is sealed

Steps to Reproduce: We run this in ECS with ALB. We build on top of the official Hashcorp Docker image to pull in our own SSL certificate.

[
  {
    "essential": true,
    "name": "vault",
    "portMappings": [
      {
        "hostPort": 0,
        "containerPort": 8200,
        "protocol": "tcp"
      }
    ],
    "linuxParameters": {
      "capabilities": {
        "add": ["IPC_LOCK"]
      }
    },
    "environment": [
      {
        "name": "AWS_DEFAULT_REGION",
        "value": "us-east-1"
      }
    ],
    "command": ["server"],
    "image": "${REGISTRY}/${NAME}:${VERSION}",
    "cpu": 0,
    "memoryReservation": 512
    }
  }
]

Important Factoids: We used to run Vault on EC2 instances using Consul, but have since used the Vault operator to migrate our storage backend to S3. We then moved off of EC2 instances and switched to a single container behind an ALB. We run an HTTPS health check to /v1/sys/health

For 95% of the time, Vault runs with no issues but occasionally Vault will shut down. And some times during these instances, it starts/stops a few times before running normally. This happens infrequently (ie don't see issues for a few weeks, then it happens sporadically for 10-15 min, and then fine again).

Some context on our setup, we use AWS Auth and have 100+ apps connecting to vault. Most of these apps are batch jobs and we pull secrets from vault at run time dynamically. We keep low TTL's. Typically, we'll have a lot of jobs starting on the hour, but there's no common theme I'm able to determine here. The IP's in the log above are from the ALB and the SSL certificate we are using in vault has not expired. As I'm sure you know, ALB/ELB's do not validate SSL certificates either anyway.

In monitoring our container, I never see CPU or memory balloon. CPU is minimal and memory never goes beyond 100MB when we alot at least 512MB. I've added our Dockerfile we use to build our container:

ARG VAULT_VERSION

FROM mesosphere/aws-cli

ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY

RUN aws s3 cp --recursive s3://company-devops/ssl/company.com/ /vault/config/

FROM vault:$VAULT_VERSION

COPY --from=0 --chown=vault /vault/config/ /vault/config/
COPY --chown=vault config.hcl /vault/config/

RUN chmod 400 /vault/config/company.com.crt /vault/config/company.com.key

RUN mkdir -p /var/log/vault && chown -R vault:vault /var/log/vault

CMD ["server"]
andydawkins commented 4 years ago

Was there ever a solution to this? I have the same problem.

jefferai commented 4 years ago

The TLS error isn't causing the shutdown. That message is printed when the server command's shutdown channel is called (https://github.com/hashicorp/vault/blob/master/command/server.go#L659), and the shutdown channel is tied to SIGINT and SIGTERM (https://github.com/hashicorp/vault/blob/master/command/commands.go#L658).

Something is sending your container or the process in the container one of those signals.

andydawkins commented 4 years ago

Thanks @jefferai

For anybody else suffering the same problem, we believe that the issue was that we were bumping up against the hard memory limit for the Docker Container. We had set it to only 900Mb and when it was hitting this limit it would stop responding and the Load Balancer would fail health checks and kill the container.

The simple solution was to raise the hard limit on memory to nearer the hardware requirements listed on the hashicorp website.

nocode99 commented 4 years ago

We found out it's because we had busy neighbors on our ECS clusters no matter how much resources we gave Vault. We ended up moving back to dedicated instances and don't run into this issue anymore.

jefferai commented 4 years ago

Sounds good, thanks both of you for following up! Will close.