radixdlt / olympia-node

Radix monorepo
Other
142 stars 35 forks source link

Checkpointer writing an insane amount of data to disk #513

Open Mattiabe98 opened 3 years ago

Mattiabe98 commented 3 years ago

image

As you can see from the pic, this is the amount of data that the Radixdlt node software wrote to the disk in just 5 minutes. 8GB in 5 minutes is 96GB/hour and 2.3TB/day. The writes appear to come in bursts, you can see this behavior in this gif.

The writes also seem highly related to the CPU usage: (they're all writes even though the legend says reads) image image

This is the SMART report of a new drive that has been used for just a week on a backup node for my Radix validator. 3% wearout in a week..

image

fatrace -c -f W running for just a few seconds shows lots of writes in the .jdb files.

image

This is the status of my data folder with the Radix DB.

image

Following this Radix docs I created a je.properties file (before the first sync and never changed after).

root@radix-backup:/data# cat je.properties 
# Set the log file size to 1Gb each (Default: 100Mb)
je.log.fileMax=1073741824
# Run the checkpointer every ~250Mb of data (Default: 20Mb)
je.checkpointer.bytesInterval=250000000

radixSSD.zip

The attached radixSSD.zip file contains all my je.properties, je.stat, je.info, je.config files for troubleshooting. I've had this issue happen on two different servers, one selfhosted and one on Hetzner, both dedicated servers running Proxmox and the Radix software inside LXC containers. Disabling the archive endpoint has no effect on the disk I/O. I also tried changing the logging to debug for more info but I got nothing useful. Changing to error also had no change.

If any other kind of information is needed please don't hesitate to ask. Thank you.

Mattiabe98 commented 3 years ago

I can easily reproduce the issue on two setups (main and failover node) by simply switching them to "validator" mode (making them use a validator's keystore). When in "fullnode" mode, thus not validating, I don't see the huge amount of disk writes. This could maybe help narrow down the issue..?