Open vickysy84 opened 2 years ago
Hey @vickysy84
Sorry to hear you're having this issue, I have a couple of questions that may help us figure this out. If we dig into this and it happens to be a vault issue, i'll go ahead and transfer the issue so you don't have to make a new one.
So I have a couple of questions:
grep 'Out of memory' /var/log/messages
, do you see any out of memory errors from consul? consul kv export vault/logical/ > vault_secrets.json
to export your kv data then ls -lh vault_secrets.json
I think that'd be a good start to understand what's going on, but i'd also reccomend checking out our guides on Inspecting data in Consul storage and Performance Tuning Vault ( which contains sections on linux specific steps and steps for the consul backend )
Hi @vickysy84 ,
It would be helpful if you could show the contents of your Consul data directory on a problem node, with file names and sizes - e.g. something like this:
maxb@q:~$ tree -h /var/lib/consul/data/
/var/lib/consul/data/
├── [ 48] acl-tokens.json
├── [ 394] checkpoint-signature
├── [ 36] node-id
├── [4.0K] raft
│ ├── [2.3K] peers.info
│ ├── [ 24M] raft.db
│ └── [4.0K] snapshots
│ ├── [4.0K] 2-16536-1650095072556
│ │ ├── [ 283] meta.json
│ │ └── [1.1M] state.bin
│ └── [4.0K] 3-33531-1650096955978
│ ├── [ 283] meta.json
│ └── [2.5M] state.bin
└── [4.0K] serf
├── [ 100] local.snapshot
└── [ 78] remote.snapshot
5 directories, 11 files
(ls -lhR /var/lib/consul/data
would also work, though less readably.)
In absence of some specific data about your environment, I'll say a few things about how Consul manages data in general:
Consul is primarily an in memory data store. The working version of your entire data set is kept in RAM.
The data on disk is composed of full snapshots of previous versions of the entire data set (raft/snapshots/*
) and a record of changes since the last snapshot (raft/raft.db
).
Consul is hardcoded to retain 2 full snapshots on disk. In addition, it needs space to write out another full snapshot before deleting an old one. So, your data directory needs to be capable of storing 3 times the size of one complete directory under raft/snapshots/
, plus a bit more for the rest of the working files.
Whilst a new snapshot directory is being written, it will have a .tmp
suffix - e.g. 3-33531-1650096955978.tmp
.
Consul has a cascading failure mode, where, if it is repeatedly interrupted whilst trying to write a snapshot, it fills up its raft/snapshots/
directory with lots of .tmp
-suffixed directories, which it never automatically cleans up. In normal operation there should only be zero or one .tmp
-suffixed directory in raft/snapshots/
, depending on whether a snapshot is currently being written.
Hi, will this issue be fixed? I have 7 node of Consul on VMs with 3 of the nodes being out of space (I created VM that only have 20 GB of storage -- inside it's running Vault and Consul together). I got so many *.tmp
directories that's never cleaned up by Consul.
Or at least, if we (as the users) need to clean them manually, what's the safest command that we can execute?
Hi @aldy505 ,
This particular GitHub issue appears to be an unconfirmed user report, in which the original reporter has never responded to requests for additional information. Therefore, I don't think it's even possible to know what the issue is, let alone consider fixing it.
On the other hand, if you're looking for comments about what I said:
Consul has a cascading failure mode, where, if it is repeatedly interrupted whilst trying to write a snapshot, it fills up its
raft/snapshots/
directory with lots of.tmp
-suffixed directories, which it never automatically cleans up.
then you are probably better off creating a new issue which is solely and clearly about that.
Personally, I (a community member only) have no idea whether HashiCorp have that on their roadmap.
I would recommend anyone running Vault with Consul storage these days to seriously consider migrating to Vault's built-in Raft storage, and eliminating Consul from the infrastructure. The migration is not simple, but eliminating Consul as a dependency of your Vault infrastructure is quite a payoff, and it's clearly the direction HashiCorp seem to be throwing most support behind long term.
Consul only ever writes to one .tmp
-suffix snapshot directory at a time. Therefore you can know it is safe to delete them if any newer snapshot directory - either .tmp
-suffixed, or complete, exists. I would implement a cron job scanning for such directories and deleting them on any production Consul cluster.
Oh, and the Discourse discussion board at https://discuss.hashicorp.com is a good place for asking for advice about HashiCorp product operations, when the questions don't fit as direct bug reports.
Hi @maxb, thanks for the reply.
This particular GitHub issue appears to be an unconfirmed user report, in which the original reporter has never responded to requests for additional information. Therefore, I don't think it's even possible to know what the issue is, let alone consider fixing it.
I'll consider making a separate issue to tackle *.tmp
file cleanup by Consul. But considering what you said here...
I would recommend anyone running Vault with Consul storage these days to seriously consider migrating to Vault's built-in Raft storage, and eliminating Consul from the infrastructure. The migration is not simple, but eliminating Consul as a dependency of your Vault infrastructure is quite a payoff, and it's clearly the direction HashiCorp seem to be throwing most support behind long term.
I'll do some research to migrate from Consul as storage backend. Thanks for the tip.
Overview of the Issue
We use Consul as backend storage for Vault HA setup. The data directory suddenly spiked from ~70% to 100% data usage for 2 of the 3 consul nodes even when we try to clean up the space and restart the Consul services.
Reproduction Steps
Steps to reproduce this issue, eg:
Operating system and Environment details
RHEL 7, 5GB space for data-dir