ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.11k stars 3.01k forks source link

requesting node metrics from the API in a short interval causes a high CPU load #7528

Open RubenKelevra opened 4 years ago

RubenKelevra commented 4 years ago

Version information:

go-ipfs version: 0.7.0-dev
Repo version: 10
System version: amd64/linux
Golang version: go1.14.4

master@7ce1d751f

Description:

I'm running ipfs on a new server with an SSD storage. I'm writing a lot of individual files with ipfs add --chunker 'buzhash' --cid-version 1 --hash 'blake2b-256' to the node, copy them to the right location in the MFS and unpin them again (since ipfs files write doesn't support setting a non-standard chunker).

Afterwards, the MFS-folder-CID is pinned on ipfs-cluster, which runs on the same node.

ipfs-cluster shows that all cluster-pins are locally pinned, which are part of the pinset.

Another remote server has also all pins of the cluster set pinned, two other servers still catch up - so they are receiving blocks from the local node.

The low bandwidth use, while it should send a somewhat large folder to two other nodes brought a possible issue to my attention - the outgoing network speed was shown as around 4 MBit/s which is extremely slow for a server basically doing nothing else.

The CPU usage (around 200%) is extremely high for the network usage, so I thought it might still publish CIDs, and went to sleep.

System specs: 4 dedicated cores for the VM from an AMD EPYC 7702P 64-Core Processor; 16 GB of memory.

There are no background tasks running, just ipfs and ipfs-cluster. ipfs-cluster uses like no CPU resources at all.

I tried changing the dht type to dhtclient, but this resulted in no change. Restarting the service also resulted in no change, the CPU usage just jumps up again to around 200%.

The debug data (I forgot to collect the last ones) - and the binary since it's built from the master. When I read the cpu-profile right, it leads to lot of CPU-time being used by go-ds-badger and go-ipfs-blockstore and functions called by them (flame graph). The debug data was collected some minutes after a restart of the IPFS-daemon, while the ipfs-cluster-service was turned off.

debug.tar.gz

Here are some performance numbers collected on the system, which basically shows no difference in load, while there's only very low network traffic.

Screenshot_20200708_174457 Screenshot_20200708_174428 Screenshot_20200708_174318 Screenshot_20200708_171025 Screenshot_20200708_170937 Screenshot_20200708_170907 Screenshot_20200708_170850 Screenshot_20200708_170803 Screenshot_20200708_170230 Screenshot_20200708_170047 Screenshot_20200708_165957 Screenshot_20200708_165927 Screenshot_20200708_165639 Screenshot_20200708_165452 Screenshot_20200708_165352 Screenshot_20200708_165323

Config

DisableBandwidthMetrics and DisableNatPortMap are true, EnableAutoRelay and EnableRelayHop are false. I use the server-profile and routing.type is dhtclient. I use the badgerds, StorageGCWatermark is 90, StorageMax is 280GB.

$ ipfs repo stat
NumObjects: 610148
RepoSize:   98410186788
StorageMax: 280000000000
RepoPath:   /var/lib/ipfs
Version:    fs-repo@10

I use the systemd-hardening.service file from the repo, but changed the ExecStart to

/usr/bin/ipfs daemon --enable-gc --enable-pubsub-experiment --enable-namesys-pubsub

RubenKelevra commented 4 years ago

Okay, I found the reason:

Netdata is polling the object count, the repo size, and the peers from the IPFS node via the API. IPFS doesn't seem to cache the values and update them when they change (write-through-cache-strategy).

Since Netdata is polling metrics quite often, this is causing the issue. As a temporary workaround, the plugin for IPFS can be configured to use a larger data collection frequency...

So this turns into an improvement request, that polling the API for those metrics doesn't cause large CPU loads.

Stebalien commented 4 years ago

The repo size is memoized, the number of objects is not. Try polling ipfs repo stat --size-only.

thienpow commented 4 years ago

pooling for peers count by requesting a full list of peers is not making sense also

const peerInfos = await ipfs.swarm.peers({ timeout: 2500 })
return peerInfos.length