jaemk / cached

Rust cache structures and easy function memoization
MIT License
1.57k stars 95 forks source link

DiskCache blobs aren't cleaned on overwrite #206

Open RustyNova016 opened 7 months ago

RustyNova016 commented 7 months ago

I am using a DiskCache with an inner vector, making files in the blob folder of the cache. However, those blob files aren't removed/overwritten when I'm overwriting a cache entry. This leads to inflated storage usage that I rather not have since those structs are heavy.

Here's a example program to test the bug:

use cached::{DiskCache, IOCached};
use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize)]
struct BigStruct {
    pub id: String,
    pub data: Vec<Vec<u128>>
}

fn main() {
    let cache: DiskCache<String, BigStruct> = DiskCache::new("test").build().unwrap();

    // Fill the cache with lots of data
    for i in 0..100 {
        let big_struct = BigStruct {
            id: i.to_string(),
            data: (0..100).into_iter().map(|j| {
                (0..100).into_iter().map(|k| {
                    i + j + k
                }).collect()
            }).collect()
        };

        // cache_remove doesn't do anything either!
        cache.cache_remove(&big_struct.id).unwrap();
        cache.cache_set(big_struct.id.clone(), big_struct).unwrap();
    }
}
RustyNova016 commented 6 months ago

Update: DiskCacheBuilder::set_sync_to_disk_on_cache_change doesn't change anything.

ronanM commented 4 months ago

Running this test multiple times seems to reproduce the problem of accumulating files in the blobs folder of the cache:

#[io_cached(
    disk = true,
    time = 1000000,
    map_error = r##"|e| TestError::DiskError(format!("{:?}", e))"##,
    key = "u32",
    convert = r"{ n }",
)]
async fn cached_disk(n: u32) -> Result<Vec<u32>, TestError> {
    Ok((0..n).collect::<Vec<_>>())
}

#[tokio::test]
async fn test_cached_disk() {
    let n = 1_000_000;
    let expected = Ok((0..n).collect::<Vec<_>>());
    assert_eq!(cached_disk(n).await, expected);
    assert_eq!(cached_disk(n).await, expected);
}
adundovi commented 4 months ago

Unfortunately, this issue makes DiskCache useless.

jaemk commented 4 months ago

Thanks for pointing this out. I'm able to reproduce also. It looks like this is due to either a bug or the intended operation of the underlying sled::Db. Also looks like development has slowed down on that project. It would probably be best to change the backing store to something more reliable, like sqlite

jaemk commented 4 months ago

There is a potential ugly workaround until the storage backend is swapped. DiskCacheBuilder lets you set a specific disk directory set_disk_directory where each function's file cache will be stored. It looks like sled is continuously creating a new copy of the blob data with the name being an incremental integer. You could scan this directory on startup, for each sub-dir (the name of a function in all caps + _v1) scan the blobs/ dir, sort the files and delete all but the latest file (either keep the largest integer name or check the file metadata for last created)

adundovi commented 4 months ago

Thanks for pointing this out. I'm able to reproduce also. It looks like this is due to either a bug or the intended operation of the underlying sled::Db. Also looks like development has slowed down on that project. It would probably be best to change the backing store to something more reliable, like sqlite

Thanks for prompt reply.

It seems to me that sled::Db is working fine in a sense that it does speed up the function look-up operations, yet it duplicates "blobs" as you mentioned. Regarding sqlite - it wouldn't be my first choice due to features overkill for this purpose, and it isn't a cargo-friendly dependency.

There is a potential ugly workaround until the storage backend is swapped. DiskCacheBuilder lets you set a specific disk directory set_disk_directory where each function's file cache will be stored. It looks like sled is continuously creating a new copy of the blob data with the name being an incremental integer. You could scan this directory on startup, for each sub-dir (the name of a function in all caps + _v1) scan the blobs/ dir, sort the files and delete all but the latest file (either keep the largest integer name or check the file metadata for last created)

I thought about something similar myself, but it is really an ugly workaround :-) I'll try.

ronanM commented 4 months ago

It would probably be best to change the backing store to something more reliable, like sqlite

RocksDB may be a better choice.

RustyNova016 commented 4 months ago

Personally I have switched to sled DB at home and did a combination of Cacache for the storage and serde_rmp to convert to binary data

Not sure if it's the best way tho. But once this issue is sorted I'm definitely switching back