ipfs daemon memory usage grows overtime: killed by OOM after a 10~12 days running

hsanjuan commented 7 years ago

Version information:

go-ipfs version: 0.4.5-dev- Repo version: 4 System version: arm/linux Golang version: go1.7

Type: Problem

Priority: P4

Description:

I have some Raspberry Pis 3 running go-ipfs daemon. Right now they don't do anything. The Pis don't handle any IPFS requests or anything. They are just there running the daemons. After about 10 days ipfs is getting killed in all of them because they are taking too much memory.

The daemons are killed around RSS=783192 My longest running daemon (11 days) has RSS=605868 A newly started daemon has RSS=92020 A one day running daemon has RSS= 542408

Questions:

What causes memory usage to steadily grow even if the daemons get no usage other than being running?
Is there a way to limit it?
Do we need to gather more information on this? if so, what's the best way and how can I help?

Related: #3318 and the question about running IPFS on platforms with limited resources.

jonnycrunch commented 7 years ago

Same here:

ipfs version 0.4.3 Ubuntu 16.0.4 ( 4.4.0-47-generic ) go-lang 1.7

after about 10 days memory grows to about 15G despite only a few hundred files pinned. Issue is replicated across 10 servers. Restarting the daemon fixed it but continues to grow and needs to be restarted.

UPDATE: Ah, ha! I found the enable garbage collection flag in the documentation, so trying:

ipfs daemon --enable-gc

whyrusleeping commented 7 years ago

@jonnycrunch the --enable-gc flag refers to disk gc, not memory gc.

The memory leakage is coming from somewhere else... Next time the memory gets out of hand can you get me the debug info described here: https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning

Particularly the heap profile, goroutine dump and ipfs binary

koriaf commented 7 years ago

Hi! We are using this ipfs 0.4.4 at Linux 4.4.35-33.55.amzn1.x86_64 #1 SMP Tue Dec 6 20:30:04 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Currently it eats 65-76% of memory at 2GB instance, OOM sometimes kills it and it starts again and usage grows during several hours to given value. But looks like this is enough for the daemon to not be killed - may be it uses some smart way to determine how not to be killed :-) While experimenting with memory limits I saw that usage grows to whole available memory but not more (no swap used for IPFS, but other applications may have problems with available memory).

ipfs/QmaB2FJr1Z6yGRy9G37aXsBirR43Lc9ya3Q29R4gMYDVDv - dumps. Shall I recreate my node after sharing these files? Do they have a chance to contain any private keys or other data? Node is disposable, and contains no private files yet, but may have in future.

Also I noted that after running disk gc (ipfs repo gc) memory decreased from 70% to 65%, but after adding this debug directory it's again 75% of total host memory.

I have no idea how go works, so if you need more debug info or this one is unhelpful - please feel free to ask for more details.

Also, I have ipfs node run at 512MB Digitalocean instance, and it's managed by supervisord. OOM kills it there pretty fast (several hours), and supervisord starts it, and it dies again, and again, but generally works okay.

come-maiz commented 7 years ago

Carla Sella, from the Ubuntu community, reports that using the ipfs v0.4.4, her virtualbox vm starts to get slow after it connects to over 70 peers. Here are her debugging files. ipfs.tar.gz

jonnycrunch commented 7 years ago

Maybe it is time for Garbage Collection to be enabled by default? @whyrusleeping @RichardLitt @diasdavid

Kubuxu commented 7 years ago

@jonnycrunch as @whyrusleeping said, the --enable-gc flag is datastore garbage collection, not the program garbage collection.

The core problem is what we call "connection closing", IPFS is currently connecting with almost everyone which in connection with muxer implementation we are currently using takes a lot of memory. We are working on reducing it but it might take a while. The connection closing is much harder problem we initially expected.

The --enable-gc flag shouldn't matter, it might reduce memory usage a bit, but it isn't the core problem as far as I know.

hsanjuan commented 7 years ago

This is the debugging information I have collected from 1 node that was still running (2 have died):

https://ipfs.io/ipfs/QmXnYzZT1EAq9pzi6snd6KHD8kNrBSDuyJqLPe7QHzUE23

It was also using 150% CPU when I checked it and >80% MEM. They are still on 0.4.5-pre1 though.

bdimych commented 7 years ago

stack dump from #61 , this is a vps with CentOS 7 64 with 1Gb memory, ipfs daemon crashed in 5 days after start: ipfs-crash-May-07-grep-ipfs-var-log-messages.zip

ipfs package go-ipfs_v0.4.8_linux-amd64.tar.gz

whyrusleeping commented 6 years ago

Hey everyone, ipfs 0.4.11 should have some significant improvements here. The issue is not entirely resolved, but the leak should be mitigated.

maznu commented 6 years ago

Still leaking memory in 0.4.13 — killed after ~12 hours.

Stebalien commented 6 years ago

At the moment, the largest issue is the peerstore. We had a rather nasty bug that will be fixed in the next release (we, uh, kind of didn't forget any address of any peer to which we had ever connected and, worse, advertised these (sometimes ephemeral) addresses to the network..).

victorb commented 6 years ago

@Stebalien

that will be fixed in the next release

Does that mean that the fix is already in master or is work in progress?

Stebalien commented 6 years ago

Fixed in a dep. PR pending: #4610

On January 28, 2018 2:29:49 AM PST, "ᴠɪᴄᴛᴏʀ ʙᴊᴇʟᴋʜᴏʟᴍ" notifications@github.com wrote:

@Stebalien

that will be fixed in the next release

Does that mean that the fix is already in master or is work in progress?

paralin commented 6 years ago

I profiled it and it seems like a lot of the CPU waste is surprisingly in AddAddrs in the AddrManager. Reading that code, it seems very hasty and not performance minded. I'll PR something to go-libp2p-peerstore to optimize that with concurrent maps, which should help.

Stebalien commented 6 years ago

I'll PR something to go-libp2p-peerstore to optimize that with concurrent maps, which should help.

Unfortunately, the issue is https://github.com/libp2p/go-libp2p-peerstore/issues/26 and the fact that the number of multiaddrs assigned to a peer can grow unchecked*. The peerstore actually works fine with a sane number of addresses.

*The previous version of go-ipfs failed to forget observed multiaddrs for peers and, worse, would gossip these observed multiaddrs. That combined with NATs and ephemeral ports lead to a build up of addresses for some peer.

The solution to this is really to sign peer address records (should be doing this anyways), enforce a maximum number of addresses, and require that there only be one valid peer address record per peer.

paralin commented 6 years ago

Yeah, but that code is still unoptimized and in general really rough, even for a small number of addresses. Agreed that there is a bigger reason though as you describe.

maznu commented 5 years ago

Still leaking memory in 0.4.18, between 0-100kB/sec (averaging at a rate of somewhere around 10kB/sec).

whyrusleeping commented 5 years ago

@maznu are you sure its leaking memory? go is a garbage collected language, which means memory usage will appear to increase until a GC event. after a GC event, memory doesnt necessarily get released back to the OS, but internally the previously allocated memory will get used.

How are you measuring this?

EugeneChung commented 5 years ago

Still leaking memory in 0.4.18, between 0-100kB/sec (averaging at a rate of somewhere around 10kB/sec).

https://golangcode.com/print-the-current-memory-usage/

Using this periodically, you can gather memory usages of several days. With a graph tool like Microsoft Excel, you can check tendency of memory usages.

maznu commented 5 years ago

Several days? It's eating up all the RAM on a 1Gb VPS (and then being killed by the kernel oom) within eight hours.

You can see there that there is garbage collection and freeing back to the OS — plenty of green spikes within that orange lump of usage — but fundamentally it just continues to grow.

paralin commented 5 years ago

Can someone with bad memory usage please grab a memory trace?

alexkursell commented 5 years ago

Can someone with bad memory usage please grab a memory trace?

I am experiencing this issue using go-ipfs 0.4.19: https://ipfs.io/ipfs/QmSkYDJV1BJeLm2uEBqnshcmBRb1LMPPPxdBsUrGDNGv8J

For me it takes ~2 days for the daemon to exhaust 1GB of memory and get OOM killed.

Stebalien commented 5 years ago

@alexkursell I'm only seeing ~30MiB of memory usage on the heap. Unfortunately, I can't seem to download the goroutine stack traces.

When you grabbed that memory dump, how much memory was go-ipfs using (at that point in time).

whyrusleeping commented 5 years ago

The biggest problem i'm seeing with memory usage lately isnt that ipfs always uses a lot of memory, its that it randomly spikes to a lot of memory, and go will pretty much never release that memory.

To debug this further, I would put a memory limit on the ipfs process (say, 1GB) so that it panics when the memory spikes, and we can then figure out what the problem is.

alexkursell commented 5 years ago

@Stebalien. I've grabbed a new set of diagnostics, along with the output of top: https://ipfs.io/ipfs/QmVB4s9Eu1XYxbikuzQix6SGUoDtqS46oyPJFanWLRMwV5 At the time this was taken, it looks like the daemon was using around 750mb.

marrub-- commented 5 years ago

I was able to run an ipfs node just fine for a while but it's started taxing my server so much it's impossible to continue using. It would be fine even if it used a gigabyte, but it continues eating more and more memory until the server simply crashes.

Stebalien commented 5 years ago

@alexkursell

Go is "only" using about 300MiB of heap memory so it looks like memory usage spiked at some point and go never returned the memory.

The largest actual memory users appear to be:

Provider records. This issue should be fixed in the latest go-ipfs master.
The peerstore (information about peers we've seen). We have a PR (#6080) for putting this on-disk but I'd like to do a bit of testing before we land that. I'm also seeing https://github.com/libp2p/go-libp2p-peerstore/issues/68.

kaysond commented 5 years ago

+1. I just set up a node on an Ubuntu 19.04 vps, and it died after about a day. I'll try the latest master and see if that fixes it.

whyrusleeping commented 5 years ago

@kaysond (and others) when your nodes die due to running out of memory, can you please send us the stack traces? It will help us track down whats causing the memory spikes.

kaysond commented 5 years ago

I built from the latest source, and it seems to have grown steadily then leveled off at around 600MB overnight.

kaysond commented 5 years ago

@whyrusleeping after a few days it looks like it settled out at a solid 1GB RAM. I've attached all the dumps per the debug guide memdebug.tar.gz

Stebalien commented 5 years ago

@kaysond

It looks like that memory is:

The peerstore (fix in https://github.com/ipfs/go-ipfs/pull/6080).
Bandwidth metric tracking. Unfortunately, we never forget old peers. You can disable bandwidth tracking by setting ipfs config --json "Swarm.DisableBandwidthMetrics true".

kaysond commented 5 years ago

@Stebalien thanks. I'll add that to my config and see how much it helps. Is there a plan to implement said "forgetting"?

Stebalien commented 5 years ago

@kaysond not yet but it looks like we'll have to do that at some point. I've never seen that show up in a heap trace. You must have connected to ~0.5M (estimated) unique peers over the course of a few days.

I've filed an issue (https://github.com/libp2p/go-libp2p-metrics/issues/17) but it's unlikely to be a priority given that most systems connecting to that many peers have quite a bit of memory (unless that was entirely DHT traffic...).

That brings up a good point. If you're memory constrained, try running the daemon with --routing=dhtclient.

kaysond commented 5 years ago

I set up a node mainly to serve a single website from ipfs, so the less memory it uses the cheaper my VPS can be.

I'm skeptical that the site draws that much traffic... so I guess its just the nature of being connected to the swarm? The node isn't exactly a public gateway, so I'm not sure what caused all of the connections.

I'll try it with that option and see what happens.

Stebalien commented 5 years ago

The node isn't exactly a public gateway, so I'm not sure what caused all of the connections.

Probably the DHT.

mkg20001 commented 4 years ago

Any updates on this?

mkg20001 commented 4 years ago

Btw, the command to disable bandwith metrics didn't work anymore, the new one is ipfs config --bool Swarm.DisableBandwidthMetrics true

Is it even needed, anymore?

kaysond commented 4 years ago

With the command /usr/local/bin/ipfs daemon --enable-gc --routing=dhtclient, after several weeks my node has settled at around 500MB RAM

mkg20001 commented 4 years ago

@kaysond Used that command. This + Swarm.DisablebandwidthMetrics works, thx

Stebalien commented 3 years ago

The remaining issue is https://github.com/ipfs/go-ipfs/issues/2848. Closing this one as it's quite old.

ipfs / kubo