hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.4k stars 4.43k forks source link

Fatal error runtime: out of memory when Starting Consul agent #5327

Closed haggag closed 5 years ago

haggag commented 5 years ago

Overview of the Issue

I have a cluster with 5 servers on 0.9.4. It was working fine, then suddenly it started to crash with out of memory errors as listed below. It used to be on 0.8.1 but upgrading ot 0.9.4 didn't help. Upscaling the servers to from 30GB to 60GB helped for sometime, then it started to occur again, so memory was doubled again!

Raft.db is 86M but there are couple of folders under the snapshots folder with state.bin in each having 3.8G!

It also worth mentioning that since this incident started to occur, the start time takes several minutes (more than 7 mins) even when huge amount of memory is given, so probably lots of data are being loaded/manipulated.

After the long starting time (with few mins), the process looks like this: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4268 consul 20 0 66.1g 12g 22m S 15.3 5.4 14:27.22 consul

And free -m total used free shared buffers cached Mem: 245857 72399 173458 0 20 4110 -/+ buffers/cache: 68267 177590 Swap: 0 0 0

After few more mins, the memory is freed by itself, and looks like: free -m total used free shared buffers cached Mem: 245857 9718 236138 0 23 4127 -/+ buffers/cache: 5568 240289 Swap: 0 0 0

and

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4268 consul 20 0 66.1g 4.5g 24m S 27.9 1.9 15:12.56 consul

What could be demanding all of this memory? and how reduce it? I'm sure this isn't a normal behaviour

[root  raft]# ls -alh
total 81M
drwx------ 3 consul consul 4.0K Apr 19  2017 .
drwxrwxr-x 4 consul consul 4.0K Apr 26  2017 ..
-rwx------ 1 consul consul  967 Apr 19  2017 peers.info
-rw------- 1 consul consul  87M Feb  9 17:56 raft.db
drwx------ 4 consul consul 4.0K Feb  9 17:56 snapshots

[root snapshots]# ls
7732-1316111253-1549721044933  7732-1316119637-1549721355693
[root@consul06 snapshots]# ls -alh 7732-1316111253-1549721044933/
total 3.8G
drwx------ 2 consul consul 4.0K Feb  9 14:04 .
drwx------ 4 consul consul 4.0K Feb  9 17:56 ..
-rw------- 1 consul consul  640 Feb  9 14:06 meta.json
-rw------- 1 consul consul 3.8G Feb  9 14:06 state.bin

[root snapshots]# ls -alh  7732-1316119637-1549721355693
total 3.8G
drwx------ 2 consul consul 4.0K Feb  9 14:09 .
drwx------ 4 consul consul 4.0K Feb  9 17:56 ..
-rw------- 1 consul consul  640 Feb  9 14:11 meta.json
-rw------- 1 consul consul 3.8G Feb  9 14:11 state.bin

Consul info for both Client and Server

# consul info
agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease = 
    revision = 40f243a+
    version = 0.9.4
consul:
    bootstrap = false
    known_datacenters = 2
    leader = false
    leader_addr = 10.0.13.84:8300
    server = true
raft:
    applied_index = 1316254592
    commit_index = 1316254592
    fsm_pending = 0
    last_contact = 73.943367ms
    last_log_index = 1316254592
    last_log_term = 7738
    last_snapshot_index = 1316252429
    last_snapshot_term = 7738
    latest_configuration = [{Suffrage:Voter ID:10.0.13.84:8300 Address:10.0.13.84:8300} {Suffrage:Voter ID:10.0.13.231:8300 Address:10.0.13.231:8300} {Suffrage:Voter ID:10.0.13.24:8300 Address:10.0.13.24:8300} {Suffrage:Voter ID:10.0.13.52:8300 Address:10.0.13.52:8300} {Suffrage:Voter ID:10.0.13.161:8300 Address:10.0.13.161:8300}]
    latest_configuration_index = 1316252853
    num_peers = 4
    protocol_version = 2
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 7738
runtime:
    arch = amd64
    cpu_count = 32
    goroutines = 189
    max_procs = 2
    os = linux
    version = go1.9
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 809
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 7089
    members = 45
    query_queue = 0
    query_time = 42
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1170
    members = 8
    query_queue = 0
    query_time = 1

Operating system and Environment details

i3.xlarge (30 GB RAM) EC2 Server

Log Fragments

==> Starting Consul agent...
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x12c8b26, 0x16)
        /goroot/src/runtime/panic.go:605 +0x95
runtime.sysMap(0xcb0b300000, 0x100000, 0x45a200, 0x1c21998)
        /goroot/src/runtime/mem_linux.go:216 +0x1d0
runtime.(*mheap).sysAlloc(0x1c08180, 0x100000, 0x7fabe50a98e8)
        /goroot/src/runtime/malloc.go:470 +0xd7
runtime.(*mheap).grow(0x1c08180, 0x1, 0x0)
        /goroot/src/runtime/mheap.go:887 +0x60
runtime.(*mheap).allocSpanLocked(0x1c08180, 0x1, 0x1c219a8, 0x7fabe50a98e8)
        /goroot/src/runtime/mheap.go:800 +0x334
runtime.(*mheap).alloc_m(0x1c08180, 0x1, 0x9b000a, 0x7fabe50a98e8)
        /goroot/src/runtime/mheap.go:666 +0x118
runtime.(*mheap).alloc.func1()
        /goroot/src/runtime/mheap.go:733 +0x4d
runtime.systemstack(0xc420121e98)
        /goroot/src/runtime/asm_amd64.s:360 +0xab
runtime.(*mheap).alloc(0x1c08180, 0x1, 0x7fabe501000a, 0x7fabe50a98e8)
        /goroot/src/runtime/mheap.go:732 +0xa1
runtime.(*mcentral).grow(0x1c09750, 0x0)
        /goroot/src/runtime/mcentral.go:232 +0x94
runtime.(*mcentral).cacheSpan(0x1c09750, 0x7fabe04bfb60)
        /goroot/src/runtime/mcentral.go:106 +0x33a
runtime.(*mcache).refill(0x7fac112fc000, 0xc962e04d0a, 0x7fabe04bfb60)
        /goroot/src/runtime/mcache.go:123 +0xa4
runtime.(*mcache).nextFree.func1()
        /goroot/src/runtime/malloc.go:557 +0x32
runtime.systemstack(0xc420018000)
        /goroot/src/runtime/asm_amd64.s:344 +0x79
runtime.mstart()
        /goroot/src/runtime/proc.go:1125

goroutine 1 [running]:
runtime.systemstack_switch()
        /goroot/src/runtime/asm_amd64.s:298 fp=0xc420129e30 sp=0xc420129e28 pc=0x45a1d0
runtime.(*mcache).nextFree(0x7fac112fc000, 0xa, 0x1000f0101233e00, 0x0, 0x60)
        /goroot/src/runtime/malloc.go:556 +0xa9 fp=0xc420129e88 sp=0xc420129e30 pc=0x410649
runtime.mallocgc(0x40, 0x1199f20, 0x1071801, 0xdf)
        /goroot/src/runtime/malloc.go:711 +0x6fa fp=0xc420129f30 sp=0xc420129e88 pc=0x410f1a
runtime.newobject(0x1199f20, 0x0)
        /goroot/src/runtime/malloc.go:840 +0x38 fp=0xc420129f60 sp=0xc420129f30 pc=0x411258
github.com/hashicorp/consul/vendor/github.com/hashicorp/go-immutable-radix.(*Txn).insert(0xc420391740, 0xcb0b224180, 0xcb0b2ff7a0, 0x35, 0x60, 0xcb0b2ff7c7, 0xe, 0x39, 0xf85bc0, 0xcb0b2dd500, ...)
        /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-immutable-radix/iradix.go:268 +0xad0 fp=0xc42012a0c8 sp=0xc420129f60 pc=0x6e73c0
...
haggag commented 5 years ago

Any idea what might causing Consul to load/allocate such huge amount of memory at start up? (more than 60 GB)

banks commented 5 years ago

state.bin is the snapshot data from the state store. On startup each server has to read all of that into memory, replay any remaining logs from the raft.db (this actually only stores the raft logs not all data) before it can serve.

The fact that memory grew suddenly and then grew again when you increased available RAM seems to indicate that you have a badly behaved client workload somewhere that is creating lots of data. That could be new KVs or it could be new service instances - anything stored in the state store that is being created much faster than its being removed could cause this.

Since you can't get your servers to start, figuring out what is in that snapshot it a bit cumbersome as it's a binary file format. It's simple though. I have written a little tool for previous issues that is super unpolished but might help you if you are some what comfortable with Go. (On my very long list of things that would be nice to do is a version of this built into consul binary that is easier to use...)

https://gist.github.com/banks/7ab62dc82a77bd20f819c9e6dd638ee5

You'll need go installed and the two dependencies in your gopath:

go get github.com/hashicorp/consul
go get github.com/hashicorp/go-msgpack

consul is largish so I can understand not wanting to pull those onto a server even in test env. So I'd recommend, if possible you copy one of the huge state files to a local dev machine and do it there. You could also compile the binary on a local machine (cross compile if different OS) and upload the binary to the server if you'd rather not move the data. If you're not familiar with Go, I appreciate this is not super trivial to do!

The script in the gist reads from stdin and dumps all the records in the snapshot to stdout. Since your snapshot is many gigabytes the output will be huge if you leave it unmodified so you can either modify to not print the actual values (line 58) or pipe it to grep and filter it down to just the message type lines.

I'd try something like this (off the top of my head):

cat state.bin | go run dump-snap.go | grep 'Message Type' | sort | uniq -c

Which should give you the count of each type of record in the snapshot. They are numbers that correspond to the constants here: https://github.com/hashicorp/consul/blob/0fc1c203cc24a1522146de470afaff85a6d82d2f/agent/structs/structs.go#L35-L55

That should at least narrow it down to what data is ballooning your state. You can poke around further with the actual values to see what has been writing so much.

That at least should help identify the root cause. Once the actual issue is fixed (client misbehaviour stopped) you'll need to some how clean the state. The easiest way is to restore a backup snapshot from before the growth started if you have one.

If not, then it is possible to use a variant of the script above to write out a new state snapshot with the offending messages filtered out. You should then use the snapshot restore command with the filtered snapshot to ensure the raft logs also get ignored otherwise they will re-populate your state store with some of the bad entries again. If you choose this path, I strongly recommend that you try it out in a test environment first and make very sure that the state is not corrupted in the process! (i.e. check all the data when Consul comes up is present as expected after restore.)

Hope this is helpful.

banks commented 5 years ago

The other part of your question was about why Consul would need more than 60GB of RAM to load a <4GB snapshot.

I don't know exactly, but I'd guess it's because the snapshot is gzipped serialised copy of the data but when we have to expand that into memory we have to allocate for the deserialisation buffers as well as the final full-sized structs (including full copies of any strings that are repeated etc) and insert them into a radix tree in memory (which has some overhead for the nodes etc).

The fact you see memory free after a while could just due to the Go garbage collector catching up with the huge number of temporary allocations made while deserialising the data etc.

HisenWu commented 5 years ago

@banks Thank you for such a detailed reply. Read your answer carefully. I have encountered the same problem. https://github.com/hashicorp/consul/issues/5356

The difference is that the memory will not be reduced and will remain at a relatively fixed value.

In the scene where I have a problem, consul stores a lot of KV (about 4.5 million kvs)

$ ll -h
total 31M
-rwxr-x--- 1 test test 2.3K Jan 18 17:06 peers.info
-rw------- 1 test test 47M Feb 20 09:57 raft.db
drwxr-x--- 4 test test 72 Feb 19 16:51 snapshots
$ du -sh *
4.0K    peers.info
31M raft.db
2.0G    snapshots
$ ll -h snapshots/*
snapshots/130-11058879-1550565021658:
total 1008M
-rw-r----- 1 test test 541 Feb 19 16:30 meta.json
-rw-r----- 1 test  test 1008M Feb 19 16:30 state.bin

snapshots/130-11058883-1550565042436:
total 1008M
-rw-r----- 1 test test 541 Feb 19 16:31 meta.json
-rw-r----- 1 test test 1008M Feb 19 16:31 state.bin

I want to confirm the following phenomenon if normal: As can be seen from the above, the data in the consul specified data path is not large(raft.db ~47M, state.bin ~1G), but the memory usage exceeds 5.5G RAM.

In addition, I really want to know where the memory is occupied. Is there any relationship between KV data, raft data, snapshot data and memory?

Looking forward to your reply!

banks commented 5 years ago

Is there any relationship between KV data, raft data, snapshot data and memory?

Yes! Consul is an in-memory database so everything you store in it (KVs, catalog info) consume memory.

The raft log in raft.db is only the last ~10k entries so shouldn't grow much or be directly correlated to memory usage. The state.bin file is a complete copy of the DB which is directly loaded into memory on start up. So if that is large it means you are storing lots of info.

The reason your state.bin is 1GB but Consul takes up 5GB is likely due to a few factors:

I incorrectly stated before that the state.bin is gzipped - we do gzip snapshots you make using the manual snapshot command but the ones raft makes in the background are not currently.

To debug why your state is big you'll need to use a tool like the one posted above. As mentioned I'd love to make this easier to use some day.

I actually just went and made a slightly cleaner version you can try: https://github.com/banks/consul-snapshot-tool

It's go-gettable and just outputs summary info in a reasonable format with no other options so should be simpler to use. Hope this helps.

HisenWu commented 5 years ago

Thank you very much for your quick reply.

Analyze the snapshot to get the size of the overall decoded data through the consul-snapshot-tool

# cat state.bin | consul-snapshot-tool 
           Record Type    Count   Total Size
---------------------- -------- ------------
                   KVS  5342478        1.2GB
              Register      729      314.7KB
 CoordinateBatchUpdate      127       22.7KB
         PreparedQuery       34        9.9KB
      ACL (Deprecated)       12        9.1KB
             Autopilot        1         188B
          ACLBootstrap        1          44B
---------------------- -------- ------------
                         TOTAL:        1.2GB

2~4x

Query the information provided by the official performance website, and get the RAM usage is 2-4 times the size of the complete set of data.

@banks At this point, the doubt that the raft/snapshot data does not match the memory usage has been answered. But there are 2 questions as follows:

Q1: 2~4x is the relationship of overall decoded data and metric consul.runtime.alloc_bytes, not RAM? Q2: why memory usage(RAM) larger than metric consul.runtime.alloc_bytes, should they be equal or similar?

Looking forward to your reply!

banks commented 5 years ago

I don't really understand Q1 can you rephrase it please?

Q2 is a super in-depth question - the short answer is because they are measuring different things from different perspectives. You'd need to go and read a book on linux Virtual Memory and go's runtime allocation behaviour to get a full view of the reasons.

At a very high level though:

In other words, assuming you still have free RAM on the machine the OS doesn't aggressively "take back" memory pages Go runtime has freed until it needs to so will count the memory wired to that process higher still. This is from my fuzzy memory - I'd advise external research if you want to understand it in more depth!

Looks like all that data is in KV - maybe try looking at the keys and working out which application is storing increasing amounts of data without cleaning up?

HisenWu commented 5 years ago

Thank you very much for your detailed answer.

Your explanation has made me understand Q1. All my question has been satisfactorily answered.

https://github.com/hashicorp/consul/issues/5356 has been closed.

banks commented 5 years ago

@haggag were you able to use the tool I made to make progress on this issue?

pinkal-vansia commented 5 years ago

Hi @banks @haggag I used both of your tools to analyse the state file which is 1.3G. Result is below.

$ cat state.bin | go run dump-snap.go | grep 'Message Type' | sort | uniq -c > debug.txt
$ cat debug.txt
    182 Message Type 0
      1 Message Type 11
 178326 Message Type 2
6740343 Message Type 3
      7 Message Type 4
      3 Message Type 5
     38 Message Type 6
      1 Message Type 9

and using consul-snapshot-tool

$ cat state.bin | src/github.com/banks/consul-snapshot-tool/consul-snapshot-tool > debug-2.txt
$ cat debug-2.txt
           Record Type    Count   Total Size
---------------------- -------- ------------
               Session  6740343        1.2GB
                   KVS   178326       55.1MB
              Register      182       60.6KB
 CoordinateBatchUpdate       38        6.8KB
      ACL (Deprecated)        7        2.5KB
             Tombstone        3         327B
             Autopilot        1         196B
          ACLBootstrap        1          52B
---------------------- -------- ------------
                         TOTAL:        1.3GB

Also I managed to get actual message from state file using the tool. There are lots of messages of following type

Message Type 3
map[Behavior:delete Checks:[serfHealth] CreateIndex:1338955164 ID:1352a379-56ee-bc00-89ba-49006d621efb LockDelay:15000000000 ModifyIndex:1338955164 Name: Node:worker21.us-east-1b.domain.net TTL:86400s]
Message Type 3
map[Behavior:delete Checks:[serfHealth] CreateIndex:1344421632 ID:1352a380-c156-0c63-0f51-68934d5c67a3 LockDelay:15000000000 ModifyIndex:1344421632 Name: Node:worker21.us-east-1b.domain.net TTL:86400s]
Message Type 3
map[Behavior:delete Checks:[serfHealth] CreateIndex:1358601842 ID:1352b055-ee01-3aca-d9d6-be276203e5e7 LockDelay:15000000000 ModifyIndex:1358601842 Name: Node:worker22.us-east-1b.domain.net TTL:86400s]
banks commented 5 years ago

Session 6740343 1.2GB

That's the culprit! I'd guess you have a client that is creating new sessions without any attached health checks and a very long TTL. That means they will stay around forever. Normally Sessions are created linked to a healthcheck of a specific service instance so if that instance goes away or becomes unhealthy then it is automatically removed.

There are legitimate reasons to want to create sessions that are not attached to a health check, but if you do then it's your responsibility to clean them up rather than just creating a random new one on next restart of your app or similar.

I'd suggest you audit your clients and if you aren't aware of them directly creating sessions it might be something deep in the client library you are using that is either buggy or is being misconfigured in your clients. See https://www.consul.io/api/session.html. Hope you catch those badly behaved clients!

Note you can also list session and then delete them maybe just if they seem old and then see if you still have bad clients creating new ones quickly or f it ws just a one-off thing in the past!

pinkal-vansia commented 5 years ago

thanks @banks is this value correct in message? At code, we are using default 15s value. LockDelay:15000000000

pinkal-vansia commented 5 years ago

@banks There are literally millions of Type 3 messages. Is there a way to delete them from state.bin file easily as a short term measure while we are auditing our clients.

pinkal-vansia commented 5 years ago

Hello @banks Today, for a usual MW activity I stopped consul agent on one of the worker server for few minutes. After 10-15 mins I saw sudden drop in size of state.bin file. from 2.9 GB to 1.6 GB. This lead me to stop-start consul agent on all workers servers and as a result within few minutes overall size of state.bin dropped to 393 MB. but then its been steadily increasing at a rate of ~5mb/H.

banks commented 5 years ago

Hey @pinkal-vansia the growth seems to be in sessions that are not being deleted or cleaned up. In other words your clients (worker servers) are using sessions in a way that leaves a lot of garbage around. Have you been able to debug anything about what clients are doing?

stale[bot] commented 5 years ago

Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one :+1: