Closed haggag closed 5 years ago
Any idea what might causing Consul to load/allocate such huge amount of memory at start up? (more than 60 GB)
state.bin
is the snapshot data from the state store. On startup each server has to read all of that into memory, replay any remaining logs from the raft.db (this actually only stores the raft logs not all data) before it can serve.
The fact that memory grew suddenly and then grew again when you increased available RAM seems to indicate that you have a badly behaved client workload somewhere that is creating lots of data. That could be new KVs or it could be new service instances - anything stored in the state store that is being created much faster than its being removed could cause this.
Since you can't get your servers to start, figuring out what is in that snapshot it a bit cumbersome as it's a binary file format. It's simple though. I have written a little tool for previous issues that is super unpolished but might help you if you are some what comfortable with Go. (On my very long list of things that would be nice to do is a version of this built into consul binary that is easier to use...)
https://gist.github.com/banks/7ab62dc82a77bd20f819c9e6dd638ee5
You'll need go installed and the two dependencies in your gopath:
go get github.com/hashicorp/consul
go get github.com/hashicorp/go-msgpack
consul is largish so I can understand not wanting to pull those onto a server even in test env. So I'd recommend, if possible you copy one of the huge state files to a local dev machine and do it there. You could also compile the binary on a local machine (cross compile if different OS) and upload the binary to the server if you'd rather not move the data. If you're not familiar with Go, I appreciate this is not super trivial to do!
The script in the gist reads from stdin and dumps all the records in the snapshot to stdout. Since your snapshot is many gigabytes the output will be huge if you leave it unmodified so you can either modify to not print the actual values (line 58) or pipe it to grep and filter it down to just the message type lines.
I'd try something like this (off the top of my head):
cat state.bin | go run dump-snap.go | grep 'Message Type' | sort | uniq -c
Which should give you the count of each type of record in the snapshot. They are numbers that correspond to the constants here: https://github.com/hashicorp/consul/blob/0fc1c203cc24a1522146de470afaff85a6d82d2f/agent/structs/structs.go#L35-L55
That should at least narrow it down to what data is ballooning your state. You can poke around further with the actual values to see what has been writing so much.
That at least should help identify the root cause. Once the actual issue is fixed (client misbehaviour stopped) you'll need to some how clean the state. The easiest way is to restore a backup snapshot from before the growth started if you have one.
If not, then it is possible to use a variant of the script above to write out a new state snapshot with the offending messages filtered out. You should then use the snapshot restore command with the filtered snapshot to ensure the raft logs also get ignored otherwise they will re-populate your state store with some of the bad entries again. If you choose this path, I strongly recommend that you try it out in a test environment first and make very sure that the state is not corrupted in the process! (i.e. check all the data when Consul comes up is present as expected after restore.)
Hope this is helpful.
The other part of your question was about why Consul would need more than 60GB of RAM to load a <4GB snapshot.
I don't know exactly, but I'd guess it's because the snapshot is gzipped serialised copy of the data but when we have to expand that into memory we have to allocate for the deserialisation buffers as well as the final full-sized structs (including full copies of any strings that are repeated etc) and insert them into a radix tree in memory (which has some overhead for the nodes etc).
The fact you see memory free after a while could just due to the Go garbage collector catching up with the huge number of temporary allocations made while deserialising the data etc.
@banks Thank you for such a detailed reply. Read your answer carefully. I have encountered the same problem. https://github.com/hashicorp/consul/issues/5356
The difference is that the memory will not be reduced and will remain at a relatively fixed value.
In the scene where I have a problem, consul stores a lot of KV (about 4.5 million kvs)
$ ll -h
total 31M
-rwxr-x--- 1 test test 2.3K Jan 18 17:06 peers.info
-rw------- 1 test test 47M Feb 20 09:57 raft.db
drwxr-x--- 4 test test 72 Feb 19 16:51 snapshots
$ du -sh *
4.0K peers.info
31M raft.db
2.0G snapshots
$ ll -h snapshots/*
snapshots/130-11058879-1550565021658:
total 1008M
-rw-r----- 1 test test 541 Feb 19 16:30 meta.json
-rw-r----- 1 test test 1008M Feb 19 16:30 state.bin
snapshots/130-11058883-1550565042436:
total 1008M
-rw-r----- 1 test test 541 Feb 19 16:31 meta.json
-rw-r----- 1 test test 1008M Feb 19 16:31 state.bin
I want to confirm the following phenomenon if normal: As can be seen from the above, the data in the consul specified data path is not large(raft.db ~47M, state.bin ~1G), but the memory usage exceeds 5.5G RAM
.
In addition, I really want to know where the memory is occupied. Is there any relationship between KV data, raft data, snapshot data and memory
?
Looking forward to your reply!
Is there any relationship between KV data, raft data, snapshot data and memory?
Yes! Consul is an in-memory database so everything you store in it (KVs, catalog info) consume memory.
The raft log in raft.db is only the last ~10k entries so shouldn't grow much or be directly correlated to memory usage. The state.bin file is a complete copy of the DB which is directly loaded into memory on start up. So if that is large it means you are storing lots of info.
The reason your state.bin is 1GB but Consul takes up 5GB is likely due to a few factors:
I incorrectly stated before that the state.bin is gzipped - we do gzip snapshots you make using the manual snapshot command but the ones raft makes in the background are not currently.
To debug why your state is big you'll need to use a tool like the one posted above. As mentioned I'd love to make this easier to use some day.
I actually just went and made a slightly cleaner version you can try: https://github.com/banks/consul-snapshot-tool
It's go-gettable and just outputs summary info in a reasonable format with no other options so should be simpler to use. Hope this helps.
Thank you very much for your quick reply.
consul-snapshot-tool
# cat state.bin | consul-snapshot-tool
Record Type Count Total Size
---------------------- -------- ------------
KVS 5342478 1.2GB
Register 729 314.7KB
CoordinateBatchUpdate 127 22.7KB
PreparedQuery 34 9.9KB
ACL (Deprecated) 12 9.1KB
Autopilot 1 188B
ACLBootstrap 1 44B
---------------------- -------- ------------
TOTAL: 1.2GB
Query the information provided by the official performance website, and get the RAM usage is 2-4 times
the size of the complete set of data.
consul.runtime.alloc_bytes
: 3622368000
For more detail: https://gist.github.com/HisenWu/892031e8d38d75e88c834afd22ac983f
@banks At this point, the doubt that the raft/snapshot data does not match the memory usage has been answered. But there are 2 questions as follows:
Q1: 2~4x
is the relationship of overall decoded data and metric consul.runtime.alloc_bytes
, not RAM?
Q2: why memory usage(RAM)
larger than metric consul.runtime.alloc_bytes
, should they be equal or similar?
Looking forward to your reply!
I don't really understand Q1 can you rephrase it please?
Q2 is a super in-depth question - the short answer is because they are measuring different things from different perspectives. You'd need to go and read a book on linux Virtual Memory and go's runtime allocation behaviour to get a full view of the reasons.
At a very high level though:
* consul.runtime.alloc_bytes
is the amount of memory Go's allocator currently has allocated and not yet freed. (this is the "real" amount of memory the process is using actively).In other words, assuming you still have free RAM on the machine the OS doesn't aggressively "take back" memory pages Go runtime has freed until it needs to so will count the memory wired to that process higher still. This is from my fuzzy memory - I'd advise external research if you want to understand it in more depth!
Looks like all that data is in KV - maybe try looking at the keys and working out which application is storing increasing amounts of data without cleaning up?
Thank you very much for your detailed answer.
Your explanation has made me understand Q1. All my question has been satisfactorily answered.
https://github.com/hashicorp/consul/issues/5356 has been closed.
@haggag were you able to use the tool I made to make progress on this issue?
Hi @banks @haggag I used both of your tools to analyse the state file which is 1.3G. Result is below.
$ cat state.bin | go run dump-snap.go | grep 'Message Type' | sort | uniq -c > debug.txt
$ cat debug.txt
182 Message Type 0
1 Message Type 11
178326 Message Type 2
6740343 Message Type 3
7 Message Type 4
3 Message Type 5
38 Message Type 6
1 Message Type 9
and using consul-snapshot-tool
$ cat state.bin | src/github.com/banks/consul-snapshot-tool/consul-snapshot-tool > debug-2.txt
$ cat debug-2.txt
Record Type Count Total Size
---------------------- -------- ------------
Session 6740343 1.2GB
KVS 178326 55.1MB
Register 182 60.6KB
CoordinateBatchUpdate 38 6.8KB
ACL (Deprecated) 7 2.5KB
Tombstone 3 327B
Autopilot 1 196B
ACLBootstrap 1 52B
---------------------- -------- ------------
TOTAL: 1.3GB
Also I managed to get actual message from state file using the tool. There are lots of messages of following type
Message Type 3
map[Behavior:delete Checks:[serfHealth] CreateIndex:1338955164 ID:1352a379-56ee-bc00-89ba-49006d621efb LockDelay:15000000000 ModifyIndex:1338955164 Name: Node:worker21.us-east-1b.domain.net TTL:86400s]
Message Type 3
map[Behavior:delete Checks:[serfHealth] CreateIndex:1344421632 ID:1352a380-c156-0c63-0f51-68934d5c67a3 LockDelay:15000000000 ModifyIndex:1344421632 Name: Node:worker21.us-east-1b.domain.net TTL:86400s]
Message Type 3
map[Behavior:delete Checks:[serfHealth] CreateIndex:1358601842 ID:1352b055-ee01-3aca-d9d6-be276203e5e7 LockDelay:15000000000 ModifyIndex:1358601842 Name: Node:worker22.us-east-1b.domain.net TTL:86400s]
Session 6740343 1.2GB
That's the culprit! I'd guess you have a client that is creating new sessions without any attached health checks and a very long TTL. That means they will stay around forever. Normally Sessions are created linked to a healthcheck of a specific service instance so if that instance goes away or becomes unhealthy then it is automatically removed.
There are legitimate reasons to want to create sessions that are not attached to a health check, but if you do then it's your responsibility to clean them up rather than just creating a random new one on next restart of your app or similar.
I'd suggest you audit your clients and if you aren't aware of them directly creating sessions it might be something deep in the client library you are using that is either buggy or is being misconfigured in your clients. See https://www.consul.io/api/session.html. Hope you catch those badly behaved clients!
Note you can also list session and then delete them maybe just if they seem old and then see if you still have bad clients creating new ones quickly or f it ws just a one-off thing in the past!
thanks @banks
is this value correct in message? At code, we are using default 15s value.
LockDelay:15000000000
@banks There are literally millions of Type 3 messages. Is there a way to delete them from state.bin file easily as a short term measure while we are auditing our clients.
Hello @banks Today, for a usual MW activity I stopped consul agent on one of the worker server for few minutes. After 10-15 mins I saw sudden drop in size of state.bin file. from 2.9 GB to 1.6 GB. This lead me to stop-start consul agent on all workers servers and as a result within few minutes overall size of state.bin dropped to 393 MB. but then its been steadily increasing at a rate of ~5mb/H.
Hey @pinkal-vansia the growth seems to be in sessions that are not being deleted or cleaned up. In other words your clients (worker servers) are using sessions in a way that leaves a lot of garbage around. Have you been able to debug anything about what clients are doing?
Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one :+1:
Overview of the Issue
I have a cluster with 5 servers on 0.9.4. It was working fine, then suddenly it started to crash with out of memory errors as listed below. It used to be on 0.8.1 but upgrading ot 0.9.4 didn't help. Upscaling the servers to from 30GB to 60GB helped for sometime, then it started to occur again, so memory was doubled again!
Raft.db is 86M but there are couple of folders under the snapshots folder with state.bin in each having 3.8G!
It also worth mentioning that since this incident started to occur, the start time takes several minutes (more than 7 mins) even when huge amount of memory is given, so probably lots of data are being loaded/manipulated.
After the long starting time (with few mins), the process looks like this: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4268 consul 20 0 66.1g 12g 22m S 15.3 5.4 14:27.22 consul
And free -m total used free shared buffers cached Mem: 245857 72399 173458 0 20 4110 -/+ buffers/cache: 68267 177590 Swap: 0 0 0
After few more mins, the memory is freed by itself, and looks like: free -m total used free shared buffers cached Mem: 245857 9718 236138 0 23 4127 -/+ buffers/cache: 5568 240289 Swap: 0 0 0
and
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4268 consul 20 0 66.1g 4.5g 24m S 27.9 1.9 15:12.56 consul
What could be demanding all of this memory? and how reduce it? I'm sure this isn't a normal behaviour
Consul info for both Client and Server
Operating system and Environment details
i3.xlarge (30 GB RAM) EC2 Server
Log Fragments