Problems with BoltDB implementation

cortze commented 2 years ago

After fixing a race condition when reading a value from the DB (solved here), we realized that the crawler became unstable memory speaking.

The crawler accumulates as much memory from the system as it can, leading it to crash after 6-8 hours of crawl.

The same happens with disk usage. After those 6-8 hours of crawl, the BoltDB file takes up to 15GB of disk, when the content of the DB takes around 4MB.

We have been taking a look at BoltDB and we realized that Prysm has been using a fork of BoltDB, while we were using the original repo. This change still doesn't solve the issues.

After reading how BoltDB works, and some traces on the code we saw that the DB is leaving a large range of free pages in the DB which leads the Disk to grow. As a side effect, looks that a large number of Disk space is leading the memory heap to increase. We should consider compacting the DB every given time as in Prysm, however, our use case is not exactly the same one as the Prysm one.

So far, we have set the default DB to memory to check the stability of the tool. Any suggestions for alternative DBs or ways to fix the problem are welcome.

alrevuelta commented 2 years ago

Interesting, few things.

You are referring to a race condition. Can you elaborate more? I haven't used boltdb much but I thought it had some lock or mutex under the hood to prevent this from happening.
Regarding the db to taking 15 GB of memory. I guess we are using it wrong, so shouldn't be difficult to fix imho.

cortze commented 2 years ago

Yep, let me explain both points a bit more in-depth:

Regarding the race condition. Apparently, referring to a variable read from the fn inside the DB.methods might cause a segmentation violation if it is used later on. It is an edgy case explained in boltdb/bolt#204. The workaround to solve the edgy cases is described in comment and successfully works in our implementation.
Regarding the High disk usage, we might need to compact the DB periodically to reduce the number of free pages allocated in the file, however, looking that the bbolt implementation of Prysm, we couldn't spot any bigger difference to our own one. With the only exception of the use-case, where Prysm generally just writes once the content, and future interactions with the DB are for reading purposes (which could explain why they don't suffer from an oversize of the DB due to freepages).

migalabs / armiarma

Problems with BoltDB implementation #28