Open crusty-dave opened 4 years ago
Using batches does result in higher storage overhead until later disk GC happens. From a recent conversation on the Discord:
Batches in sled are very slightly less efficient because they must communicate additional atomic recovery metadata, but this is just 15 extra bytes in the log. However, due to the batch being atomic, the on-disk segments that are being written to during a batch may not be garbage collected until the batch completes. For huge batches that may explain some extra space usage
sled aggressively batches writes anyway so you don't really gain any perf by using them. They exist purely to communicate atomicity in the presence of crashes
expected result 2.26 GB actual result 3.46 GB sled version 0.31.0 rustc version 1.40.0 operating system Windows 10
Started with the following statistics - key was coming from one field in JSON data.
After re-keying the database, the disk storage increased by a considerable amount, despite they key size being reduced:
Note that some data had been duplicated due to mixed keys, the re-keying removed those entries.
The re-keying was done in a loop using the following algorithm:
Note that no errors were detected.
I realize that performance and storage is ongoing development, but I thought you might be interested in this data-point. I don't currently see any tools to reclaim lost space.
Perhaps using a batch transaction was the wrong approach for this?