Closed nocode99 closed 4 years ago
Was there ever a solution to this? I have the same problem.
The TLS error isn't causing the shutdown. That message is printed when the server command's shutdown channel is called (https://github.com/hashicorp/vault/blob/master/command/server.go#L659), and the shutdown channel is tied to SIGINT and SIGTERM (https://github.com/hashicorp/vault/blob/master/command/commands.go#L658).
Something is sending your container or the process in the container one of those signals.
Thanks @jefferai
For anybody else suffering the same problem, we believe that the issue was that we were bumping up against the hard memory limit for the Docker Container. We had set it to only 900Mb and when it was hitting this limit it would stop responding and the Load Balancer would fail health checks and kill the container.
The simple solution was to raise the hard limit on memory to nearer the hardware requirements listed on the hashicorp website.
We found out it's because we had busy neighbors on our ECS clusters no matter how much resources we gave Vault. We ended up moving back to dedicated instances and don't run into this issue anymore.
Sounds good, thanks both of you for following up! Will close.
Environment:
Vault Config File:
Log Output:
Steps to Reproduce: We run this in ECS with ALB. We build on top of the official Hashcorp Docker image to pull in our own SSL certificate.
Important Factoids: We used to run Vault on EC2 instances using Consul, but have since used the Vault operator to migrate our storage backend to S3. We then moved off of EC2 instances and switched to a single container behind an ALB. We run an HTTPS health check to
/v1/sys/health
For 95% of the time, Vault runs with no issues but occasionally Vault will shut down. And some times during these instances, it starts/stops a few times before running normally. This happens infrequently (ie don't see issues for a few weeks, then it happens sporadically for 10-15 min, and then fine again).
Some context on our setup, we use AWS Auth and have 100+ apps connecting to vault. Most of these apps are batch jobs and we pull secrets from vault at run time dynamically. We keep low TTL's. Typically, we'll have a lot of jobs starting on the hour, but there's no common theme I'm able to determine here. The IP's in the log above are from the ALB and the SSL certificate we are using in vault has not expired. As I'm sure you know, ALB/ELB's do not validate SSL certificates either anyway.
In monitoring our container, I never see CPU or memory balloon. CPU is minimal and memory never goes beyond 100MB when we alot at least 512MB. I've added our Dockerfile we use to build our container: