Add ability to store cache to disk

cjbassi commented 5 years ago

I saw #11 and I think it would be good to add this functionality.

Some prior art that I'm aware of includes https://github.com/brmscheiner/memorize.py which is a python library for memoization that supports storing the cache in disk. Some details about that library:

the filename of the cache for a given function is <filename-of-function>_<function-name>.cache
the folder of the cache file is either the current directory or the directory of the source file of the memoized function, and library users can specify which option
the file is written to on each function call
the file is in the pickle format

IMO, I think some better options would be to:

store the cache files in the platform specified application cache folder, so on Linux that would $XDG_CACHE_HOME/<appname> or ~/.cache/<appname> by default
write to the cache file on program exit if possible. The issue to consider with this approach is that it probably wouldn't be possible to write to the file if the program crashes/panics
there is a rust pickle library but idk if that would be the best choice :D. Using json might be difficult since you can't use tuples as key values of objects. The full list of formats serde supports is here: https://serde.rs/#data-formats for some alternatives

Let me know what you think!

edit: So I noticed you can specify a key format, so that would solve the json tuple issue actually. Definitely a nice feature!

edit2: It would probably be good to store the caches in a subdirectory of ~/.cache/<appname>, like in a directory called memoized or something.

jaemk commented 5 years ago

Thanks for the background info!

cache files: I think we would be better off requiring an argument on a new cache-type that specifies a cache-directory or a specific cache-file per function
when to write: If we have a new cache-type, you could specify the frequency you want to write to disk and provide a method for triggering a write. For panics, the best we could do is inform the user that they would need to wrap their main in a catch-panic and trigger the cache writes if they want to protect against this.
data: We could be more restrictive here also and enforce that the function return type impls serde De/Serialize and require a key format like you mentioned

cjbassi commented 5 years ago

I like your ideas. I sometimes forget how much you can do with Rust macros.

What about having users use the Return callback from the cached_control! macro to write to disk after every call, or just write the cache on program exit. That would solve the frequency issue and cache directory/file issue. So we would just be left with figuring out how to implement {de,}serialize.

Luro02 commented 4 years ago

it might be possible to make a custom Drop impl, that writes to the file.

jakobn-ai commented 3 years ago

So I'm also currently looking at this feature for a side project and would also be willing to help implement this. Personally, I'd already be happy with having serde compatibility and managing my caches by hand.

a new cache-type

@jaemk you mean having "permanent" variants of UnboundCache, SizedCache etc. (no duplicate code, but distinct structs)? That would eliminate any questions about needing serializable keys and values which would be a breaking change.

So we would just be left with figuring out how to implement {de,}serialize.

@cjbassi Implementing serde::{Serialize, Deserialize} for UnboundCache et al. seems pretty straightforward since those only use standard components. However, you cannot directly serialize the cache object like that because it's behind a once_cell::sync::Lazy (which doesn't implement serde, and I don't think is likely to ever do, and for our purposes it doesn't really need to). Therefore, I'm not sure where one would go with a mere implementation of serde for the caches.

jaemk commented 3 years ago

@jakobn-ai

a new cache-type

My original thinking around this was that introducing serialize-ability to the existing cache types would mean that the type constraints would need to change to include Serialize/Deserialize, and that's not a limitation that everyone cares about.

Now that I'm thinking about it again, a better approach would probably be a new trait that lets you convert to/from a cache-type (whose types are constrained to Serialize/Deserialize) and a json string (or file containing json). I would think there would be "no promises" around efficiency of storing/loading this data since it may (in some circumstances) require making a bunch of copies to covert to/from a cache-type's layout and a layout that's serde/json-friendly.

managing my caches by hand

Some minimal api like (assuming this new trait is in scope) UnboundCache::load_from_file("up/to/you/to/manage") and support in the proc macro...

#[cached(using_file="up/to/you/to/manage")]
fn do_stuff(...) -> u32 { ... }

would probably be sufficient (?).

jakobn-ai commented 3 years ago

UnboundCache::load_from_file("up/to/you/to/manage") and support in the proc macro...
#[cached(using_file="up/to/you/to/manage")]
fn do_stuff(...) -> u32 { ... }

@jaemk Hmmm. Do you reckon an API like serde_json::to_string(&DO_STUFF)/DO_STUFF = serde_json::from_str(...) would be possible? It wouldn't be entirely clear to me how you would map the load_from_file calls back to the specific functions (such as do_stuff in this example).

jakobn-ai commented 2 years ago

@jaemk in the meantime, would it be acceptable to make UnboundCache.store etc. fully public instead of pub(super)? That would at least provide an easy way to read the entire cache and re-populate it (not perfect as it requires re-hashing, but not terrible either IMO)

jaemk commented 2 years ago

@jakobn-ai Good idea - yeah, I'd be ok with making the UnboundCache.store public for now as a workaround

jqnatividad commented 2 years ago

Hi @jaemk @jakobn-ai , I just came across bincode, and was wondering if it can be leveraged for this feature.

jakobn-ai commented 2 years ago

@jqnatividad Bincode looks interesting. Atm, I'm using self-written formatters with serde and flate2-rs.

tirithen commented 2 years ago

Hi! This sounds like a super useful feature, the API cached has makes it so easy to use, having it persist to disk as well would be great. As for using_file I would still prefer to have a sane default to be ~/.cache/myapplication or similar so that the API could be as simple as possible, maybe just a file_persist = true or similar. That is however less important than having the feature with using_file or whichever works out.

I'm also thinking that it might be useful to have it first search a smaller memory cache, and only if that fails look on the disk for a cached result. What about implementing something like that?

I tried giving it a shot to code a draft this, but I still know to little about macros to follow that part of the code I realized when I opened the source. Anyone else that has made some experiments so far?

jaemk commented 2 years ago

I've thought about this a bit, but haven't gotten to the point of trying to write any code related to it. Seems to boil down to two areas of interest that can be implemented separately:

de/serialization I think this is the easier side, and (at least as a first pass) I think we should:

Add a two methods to the Cached trait: serialize / deserialize - or encode_to_string / decode_from_str
Just use json to keep things simple
Add impls to every cache store to convert back and forth between a serde_json-able structure. It could be as simple as converting every hashmap to a vec of key/value pairs, plus the additional attributes of the store and a format version. The most annoying ones will be the LRU (sized) types of caches which will need to be listed in a specific order and re-ingested in a specific order to maintain the LRU ordering. Also raises a question about the timed caches: should the cache-set time of the entries be kept absolute (something was cached at 1am, so when it's re-ingested it maintains a cache-set time of 1am), or should the cache-set time be converted to a relative diff (something was cached at 1am, now it's 1:15am, we save a cache_age = 15minutes, when re-ingested the cache-set time is set to now - cache_age).

persistence (macro integration) This is the harder part because everyone is going to have different needs when it comes to writing to disk. Initializing from disk seems straightforward - some option to specify a file on disk and the macro-defined cache initialization uses the new deserialize method, and those methods verify the cache format version is what they support or it's simply ignored. This can also be implemented with basic synchronous file system access since it'll happen once in a lazy-static block and library users can force it to happen at program startup by referencing the cache instance before things like a webserver are started.

One thing to get out of the way - I think persistence should be an atomic full-write, and we shouldn't go anywhere near trying to incrementally write to disk.

Some options for persistence:

A set of macro options to specify:
- "write to disk after every N function invocations" -> persistence_invocation_count
- "write to disk when invoked after N seconds has passed since the last disk write" -> persistence_invocation_seconds
- (This seems the most straightforward way to do this. Using this method, disk writes will need to be sync/async based on the function type)
A way to start a background thread to which cache invocation events are sent, and which can write to disk more frequently and without blocking the main thread of execution.
- (This seems too complicated)

tirithen commented 2 years ago

For my use case I would be more interested in serializing/deserializing by writing the struct binary data to files directly, as this is only for caching, I would not be that interested in having the files on disk human readable in JSON format or other text formats. A huge plus would also be if it would be possible without the need to manually add derive traits to every data type that I might want to cache. Bincode looked interesting but also seem to need derive traits Encode/Decode to the structs, same as serde's Serialize/Deserialize.

The cached crate has a big bonus as I think about it in that it does not need a lot of setup with the easy use cases, only having to add something like #[cached(size=100)] without having to add other macros/traits to the return types is super helpful I think (unless that will anyway be the case even if using serde, or bincode, I'm not that good on what is possible with macros yet).

The feature needs might of course vary depending on use case, but for the ones I'm thinking of saving as binary seems more performant than strings. Just some thoughts at least.

jqnatividad commented 2 years ago

Just wanted to add that with the recent addition of the RedisCache store in 0.32 you effectively have the ability to store cache entries to disk, as Redis by default persists unexpired cache entries to disk between sessions.

In my application, I used the io_cached macro with a very long TTL (28 days) and the cache is persisted between sessions:

use once_cell::sync::Lazy;

static DEFAULT_REDIS_CONN_STR: &str = "redis://127.0.0.1:6379";
static DEFAULT_REDIS_TTL_SECONDS: u64 = 60 * 60 * 24 * 28; // 28 days in seconds

struct RedisConfig {
    conn_str: String,
    ttl_secs: u64,
    ttl_refresh: bool,
}
impl RedisConfig {
    fn load() -> Self {
        Self {
            conn_str: std::env::var("QSV_REDIS_CONNECTION_STRING")
                .unwrap_or_else(|_| DEFAULT_REDIS_CONN_STR.to_string()),
            ttl_secs: std::env::var("QSV_REDIS_TTL_SECS")
                .unwrap_or_else(|_| DEFAULT_REDIS_TTL_SECONDS.to_string())
                .parse()
                .unwrap(),
            ttl_refresh: std::env::var("QSV_REDIS_TTL_REFRESH").is_ok(),
        }
    }
}

static REDISCONFIG: Lazy<RedisConfig> = Lazy::new(RedisConfig::load);

#[io_cached(
    type = "cached::RedisCache<String, String>",
    key = "String",
    convert = r#"{ format!("{}", url) }"#,
    create = r##" {
        RedisCache::new("qf", REDISCONFIG.ttl_secs)
            .set_refresh(REDISCONFIG.ttl_refresh)
            .set_connection_string(&REDISCONFIG.conn_str)
            .build()
            .expect("error building redis cache")
    } "##,
    map_error = r##"|e| FetchRedisError::RedisError(format!("{:?}", e))"##,
    with_cached_flag = true
)]
fn get_redis_response(
    url: &str,
    client: &reqwest::blocking::Client,
    limiter: &governor::RateLimiter<NotKeyed, InMemoryState, DefaultClock, NoOpMiddleware>,
    flag_jql: &Option<String>,
    flag_store_error: bool,
) -> Result<cached::Return<String>, FetchRedisError> {
  // code here
}

And with a robust, mature cache store like Redis, you have the added bonus of being able to use tools like rediscli to monitor and manage/expire the cache ( e.g. rediscli -monitor).

cc @cjbassi , @jakobn-ai , @tirithen

tirithen commented 1 year ago

I re-discovered the same limitation yet again when starting on another project. The Redis support is great for anyone using Redis on their project.

But only adding an entire Redis service just to use this crate makes little sense for projects not using Redis for persisting other state.

Having an easy to user save to disk option as well makes a lot of sense for some projects and would therefore make sense as another option.

For a lot of projects part of the goal is to have it run as a single executable without relying on external services to run, in those cases Redis is not an option.

Does anyone have hints on other good crates that offer cache annotation with disk persistence that we that need it can try out until this issue is resolved?

@jqnatividad @jaemk @cjbassi

jqnatividad commented 1 year ago

For my qsv project, leveraging Redis to persist the cache across sessions was "good enough."

But I agree that it'd be nice, if its CLI didn't have an external dependency (side note, qsv optionally bundles luau and python as interpreters to help with data-wrangling tasks, and bundling luau was soooo much easier, even though its relatively obscure as its purpose made to be an embeddable interpreter compared to Python with its external dependencies).

I recently came across the memmap2 crate, and it seems to be well-suited for persisting the cache to disk.

@tirithen , perhaps, you can give integrating it a try?

tirithen commented 1 year ago

@jqnatividad thank you for the suggestion, I'm interested in giving it a try, I'm looking on the code now. My current rust skill set makes it hard to read though the macro related code. I still think it is an interesting challenge, but I might need some help to get going.

Then there is also the challenge of agreeing on an API to try out. For the feature to be worth it, I think that the API must be fairly minimal. With Redis there are host and credentials configuration. For disk cache case the user is already logged in so it should be possible to simplify.

What I'm hoping for is something similar to adding a disk=true flag to the macro:

#[cached(disk=true, result=true)]
async fn fetch_wikipage(name: &str) -> Result<String, Error> {
   reqwest::get(format!("https://en.wikipedia.org/wiki/{}", name)).await?.text().await?
}

It would then use memmap2 or similar to store the cached result to a file named ~/.cache/*binname*/memoized/*hash-of-result*. I suppose that we might want to use something more "Windows like" for machines running that OS?

Probably also hide this code behind a feature flag as it does not make much sense in a WASM build.

How would something like this look to you @jaemk ? Is the cache file paths to hard coded, do we need configuration options, or could something as simple as a single flag work well enough? You have previously mentioned a separate cache type. The thing with that is the ease of use, for example providing a cache file path per function becomes to verbose on my opinion. But maybe having a #[cached_disk] macro utilizing the existing MacroArgs could be an alternative.

In any case I'm willing to give it a try at least if I can have some directions on how to get started, and at least a draft of the the desired API.

jaemk commented 1 year ago

Thanks for the suggestions and quesions @jqnatividad and @tirithen! There's a been good amount of thinking in this thread about how the disk writes should be handled and how the interface should look. I think I have an answer as to how I'd like both to be implemented:

For actual persistence, use https://github.com/spacejam/sled instead of cooking up a home grown disk storage solution. This provides a nice kv interface over a file and lets you configure the important bits: background flush interval (and also allows manual flushing), file location, and in-process cache size (to avoid disk).

For the interface, a new cache type should be created that largely mirrors the Redis and AsyncRedis cache types. It should implement the IoCached traits which will make it compatible with the io_cached macro and almost a drop-in replacement for the current redis types. Configuration is important, so we'd want to allow configuring the file path on a per-function basis, like the Redis type lets you configure key names, while defaulting to a function-name-derived file in a cache directory (cache dir being os specific). Also allow configuring the flush interval and cache size. We may want to add an optional flag, flush_always=true/false, to always call flush or flush_async after writing in case you have a use-case where you want to trade performance for durability - this bit isn't critical though and could be added later.

tirithen commented 1 year ago

@jaemk and @jqnatividad I got some time to make progress on the disk cache. I added it as a draft pull request. It is by no means complete, but as it takes some time for me to write it would be great with some feedback on if it looks like its going in the right direction before I spend to much more time.

The most interesting one is the disk.rs store (for now no async code, only sync) https://github.com/jaemk/cached/pull/145/files#diff-44969ddcb461480249b34639a98545dcffd53d3383bae5084c0cf5b314b8faa8

I got the tests inside that file to pass, but got a bit stuck with the proc macro parts.

I used sled as suggested, they however does not have built-in support for expiring keys so for now I just added a method that cleans when cache_get runs and more than 10 seconds has elapsed. Maybe this part should run on a separate thread somehow later on to not stall the result from the cache_get call.

Have a look on what you think about the approach, and if it makes sense I could continue a bit more.

lcmgh commented 11 months ago

I'd also love to see a disk cache, think on kubernetes volumes that are very easy to setup compared to a redis setup

AstroOrbis commented 7 months ago

Would love to see this (as well as autoloading from disk) - sqlite and other "caching" alternatives are much slower.

omid commented 7 months ago

@tirithen what's the state of your 1-year-old PR?

tirithen commented 7 months ago

@omid hi! I have not worked on it for a while now. My first problem was that I found it took a lot of time to work with as the code was so abstract and I have no longer had the same direct needs for this for a while.

As I remember it, also my personal needs was for a solution with a simple minimal annotations to cache a function result to disc rather than the more verbose/complete io_cached examples above. My need was to use something like the directories crate to find the system user cache directory automatically and append the binary name and somehow the function name. That would mean that you only need to annotate functions with something like #[cached(disk=true)] and the actual file path would be auto generated.

I felt that the preference of the maintainers was something more like the more advanced configuration capabilities of the io_cached API, so maybe my needs where not the same, and maybe this more simple annotations would be more easily solved in a separate crate? I have not worked any more on this as I have been busy on other projects.

jaemk commented 7 months ago

Hey guys, I will pick up the old PR and finish it up. We can have a simpler annotation than what's required for redis since there isn't a need to connect to some external system

AstroOrbis commented 7 months ago

Something like #[disk_cached] would be great shorthand, with a sensible default of course - optionally passing in a folder or file (which one is faster? gonna need to benchmark that) would be good, In a similar vein - would this also be a viable option for permanent/invalidatable-on-command caching?

jaemk commented 7 months ago

Permanent as in it's valid as long as the backing file exists? I think so

AstroOrbis commented 7 months ago

Yep, that's it - also a hook to programmatically delete a certain key could come in handy

tirithen commented 7 months ago

@jaemk thanks for considering implementing this, I'm certain that it will be a great addition for single user applications such as cli tools and local gui applications and similar that "restarts" often.

If it helps my use case needs have mostly been about "throttling" things like HTTP requests with a timeout for cli tools. As they restart each time you execute them in-memory does not make sense, and as the cli tool should be self contained external services like Redis does not make any sense. Both in-memory and Redis mainly makes sense for servers/services.

I understand that there might be some configuration needed, it is a great thing being able to tweak where possible, but having as many of these as opt-in as possible I believe would be great.

jaemk commented 7 months ago

merged #145 , released in 0.49.0

jaemk commented 7 months ago

Yep, that's it - also a hook to programmatically delete a certain key could come in handy

@AstroOrbis any cache-annotated function has a cache that's the function name in all caps (unless name is specified) with the same visibility (pub) as the cache-annotated function. For those generated by #[io_cached], you can call IOCached methods on them directly (for regular #[cached] you need to .lock/.read/.write first)

ex: https://github.com/jaemk/cached/blob/afb670c91e0fd9051b28edde47d856593f23f138/examples/disk.rs#L41

AstroOrbis commented 7 months ago

if it's by function name, wouldn't conflicts arise? Something like cachedir/crate_name/function_name maybe would work better, dunno

jaemk commented 7 months ago

Sure, conflicts are possible. That's why there's the name arg to specify your own name for the cache. In practice though, it's relatively unlikely there's another ident in the same package with an ALL_CAPS name of the function you're caching

tirithen commented 7 months ago

Fantastic! You got it merged, this will be great for cli tools for caching between runs.

omid commented 1 month ago

@jaemk Should we close this?

jaemk / cached

Add ability to store cache to disk #20