restatedev / restate

Restate is the platform for building resilient applications that tolerate all infrastructure faults w/o the need for a PhD.
https://docs.restate.dev
Other
1.62k stars 35 forks source link

Restate RocksDB log files sometimes hang around for a long time pushing disk usage up #2216

Open pcholakov opened 1 week ago

pcholakov commented 1 week ago

We've observed an idle environment in Restate Cloud consuming 100s of MBs even post restart. The files look like this:

> kubectl exec restate-0 -- find /restate-data -name '*LOG.old*' -exec ls -lh {} \;
-rw-r--r--. 1 1000 2000 197M Oct  3 10:49 /restate-data/restate-0/local-metadata-store/LOG.old.1727952596344409
-rw-r--r--. 1 1000 2000 209M Oct  3 10:49 /restate-data/restate-0/db/LOG.old.1727952596237953
-rw-r--r--. 1 1000 2000 203M Oct  3 10:49 /restate-data/restate-0/local-loglet/LOG.old.1727952596692048

Restarting the process clears the original files but new ones remain:

> kubectl exec restate-0 -- find /restate-data -name '*LOG.old*' -exec ls -lh {} \;
-rw-r--r--. 1 1000 2000 78M Nov  3 07:48 /restate-data/restate-0/local-metadata-store/LOG.old.1730620202653580
-rw-r--r--. 1 1000 2000 84M Nov  3 07:48 /restate-data/restate-0/db/LOG.old.1730620202546276
-rw-r--r--. 1 1000 2000 81M Nov  3 07:48 /restate-data/restate-0/local-loglet/LOG.old.1730620202879822

> kubectl exec restate-0 -- du -csh /restate-data/
242M    /restate-data/
242M    total

Same environment, a couple of days later:

> kubectl exec restate-0 -- find /restate-data -name '*LOG.old*' -exec ls -lh {} \;
-rw-r--r--. 1 1000 2000 78M Nov  3 07:48 /restate-data/restate-0/local-metadata-store/LOG.old.1730620202653580
-rw-r--r--. 1 1000 2000 84M Nov  3 07:48 /restate-data/restate-0/db/LOG.old.1730620202546276
-rw-r--r--. 1 1000 2000 81M Nov  3 07:48 /restate-data/restate-0/local-loglet/LOG.old.1730620202879822

> kubectl exec restate-0 -- du -csh /restate-data/
260M    /restate-data/
260M    total

Note the files remain unchanged.

AhmedSoliman commented 6 days ago

Do you also see any lingering SSTs that get removed after restart?

jackkleeman commented 5 days ago

Before restart:

-rw-r--r--. 1 1000 2000 2.8K Nov  3 07:50 /restate-data/restate-0/local-metadata-store/000044.sst
-rw-r--r--. 1 1000 2000 1.2K Nov  3 07:50 /restate-data/restate-0/db/000091.sst
-rw-r--r--. 1 1000 2000 1.2K Nov  3 07:48 /restate-data/restate-0/db/000080.sst
-rw-r--r--. 1 1000 2000 1.2K Nov  3 07:48 /restate-data/restate-0/db/000082.sst
-rw-r--r--. 1 1000 2000 1.2K Nov  3 07:48 /restate-data/restate-0/db/000084.sst
-rw-r--r--. 1 1000 2000 1.4K Nov  3 07:50 /restate-data/restate-0/local-loglet/000067.sst
-rw-r--r--. 1 1000 2000 9.7K Nov  3 07:50 /restate-data/restate-0/local-loglet/000068.sst

After restart:

-rw-r--r--. 1 1000 2000 2.8K Nov  3 07:50 /restate-data/restate-0/local-metadata-store/000044.sst
-rw-r--r--. 1 1000 2000 2.0K Nov  7 09:24 /restate-data/restate-0/local-metadata-store/000047.sst
-rw-r--r--. 1 1000 2000 1.2K Nov  3 07:50 /restate-data/restate-0/db/000091.sst
-rw-r--r--. 1 1000 2000 1.3K Nov  7 09:24 /restate-data/restate-0/db/000092.sst
-rw-r--r--. 1 1000 2000 1.3K Nov  7 09:24 /restate-data/restate-0/db/000093.sst
-rw-r--r--. 1 1000 2000 1.3K Nov  7 09:24 /restate-data/restate-0/db/000094.sst
-rw-r--r--. 1 1000 2000 1.3K Nov  7 09:24 /restate-data/restate-0/db/000095.sst
-rw-r--r--. 1 1000 2000 1.2K Nov  3 07:48 /restate-data/restate-0/db/000080.sst
-rw-r--r--. 1 1000 2000 1.2K Nov  3 07:48 /restate-data/restate-0/db/000082.sst
-rw-r--r--. 1 1000 2000 1.2K Nov  3 07:48 /restate-data/restate-0/db/000084.sst
-rw-r--r--. 1 1000 2000 1.4K Nov  3 07:50 /restate-data/restate-0/local-loglet/000067.sst
-rw-r--r--. 1 1000 2000 9.7K Nov  3 07:50 /restate-data/restate-0/local-loglet/000068.sst
-rw-r--r--. 1 1000 2000 1.5K Nov  7 09:24 /restate-data/restate-0/local-loglet/000070.sst
-rw-r--r--. 1 1000 2000 3.8K Nov  7 09:24 /restate-data/restate-0/local-loglet/000073.sst

So i don't see any ssts that are removed, only new ones added

AhmedSoliman commented 5 days ago

Apologies, this is not the WAL, this is the info logging files. I will send a fix to reduce logging and only keep a single log file.

AhmedSoliman commented 4 days ago

We need something like this in db_manager.rs

// keep 1 old log files by default to save space.
db_options.set_keep_log_file_num(0);
// 64MB info-log files by default
db_options.set_max_log_file_size(64_000_000);

And we need to let those values and the log-level be configured through the configuration file, and the log-level should react to live config updates.