marimo-team / marimo

A reactive notebook for Python — run reproducible experiments, execute as a script, deploy as an app, and version with git.
https://marimo.io
Apache License 2.0
6.29k stars 203 forks source link

Native Cached Results #1306

Closed dmadisetti closed 2 months ago

dmadisetti commented 4 months ago

Description

I find myself using this pattern a lot

if not os.path.isfile("sample_selection_test_all.npy"):
    np.save(
        "sample_selection_test_all", data_sample_export(test_data, test_sample)
    )
    np.save(
        "sample_selection_train_all",
        data_sample_export(train_data, train_sample),
    )

exported_train = np.load("sample_selection_train_all")
exported_test = np.load("sample_selection_test_all")

I wonder if saving computations + saving outputs is on the roadmap. (docs recommend functools memo, but I like to restart notebooks sometime)

Suggested solution

result = mo.saved(x)

execution unwraps result, saves it to file, or loads it. or after cell execution could unwrap result

for stateless, could hash ref dag as file name. (hash concated cell contents)


mo.save(plt.gca()) # Last line

allows cell to have a default output. State ensured by the ref dag hash

Alternative

Tons of ways doing this, just provided a suggestion

Additional context

No response

akshayka commented 4 months ago

It's something Myles and I have talked about, yes! We even considered using the same API name. So yes, we'll likely support something like this.

dmadisetti commented 3 months ago

@akshayka @mscolnick Do you all have a road map for this? Here is a very rough outline

def save(callback : Callable[[], T],
         sign_and_verify : Union[None, str, tuple[Callable[[str], str], Callable[[str], bool]] = None,
         store_and_load : Union[None, str, tuple[Callable[[str], str], Callable[[str], bool]] = None,
         pin_marimo_lib=False,
         pin_sys_libs=False,
         pin_local_libs=True,
         save_path="outputs",
         *,
         name) -> T:
    """
    Save functionality.

    Save a function output under a hash of cell and ancestors, as such relevant changes to the
    compute graph will invalidate a saved value. For an external example of a similar procedure,
    see the NAR spec on how inputs can be used to hash and verify file compute results:
    https://edolstra.github.io/pubs/phd-thesis.pdf (5.2.2. Store paths, p.g 93) , and the relevant
    section motivating store paths.

    Args:
        callback (Callable[[], T]): The function to be executed and its result saved.
        sign_and_verify (Union[None, str, tuple[Callable[[str], str], Callable[[str], bool]]): Optional. Function or functions to sign and verify the output. If a string is provided, it is assumed to be the name of a known signing and verifying method.
        store_and_load (Union[None, str, tuple[Callable[[str], str], Callable[[str], bool]]): Optional. Function or functions to store and load the output. If a string is provided, it is assumed to be the name of a known storing and loading method.
        pin_marimo_lib (bool): If True, the marimo library version is used for cache invalidation. Default is False.
        pin_sys_libs (bool): If True, the system libraries versions are used for cache invalidation. Default is False.
        pin_local_libs (bool): If True, the local libraries, using the __version__ variable, are used for cache invalidation. Default is True.
        save_path (str): The path where the output will be saved. Default is "outputs".
        name (str): The name of the output file.

    Returns:
        T: The result of the callback function.

    Raises:
        Exception: If the output cannot be signed, stored, or loaded.
    """

    Args:
    """
    ...
    # Run traceback, get content for code up to function call, and compile
    # Hash ancestors
    # Create digest of combined ancestors and current code content
    # if {save_path}/{hash}/{name} doesn't exist
    #  - Run function
    #  - Run sign on outputs
    #  - Save in {save_path}/{hash}/{name}
    # else:
    #  - load data!

I have a stub I could flesh out, but I wanted to discuss it in more detail first. Will definitely be smaller than my last PR lol, but all the same

akshayka commented 3 months ago

@dmadisetti — thanks for the design sketch. It's really cool!

We don't yet have a roadmap or design spec — yours is a great place to start.

Usage.

To check my understanding: Is this usage correct?

Cell 1:

x_value = 1
y_value = 2

Cell 2:

def add(x, y):
  return x + y

func = lambda: add(x_value, y_value)
x_plus_y = mo.save(func)

Decorator

Would it be possible to have an API that could be used as a decorator?

@mo.save
def add(x, y):
  return x + y

x_plus_y = mo.save(add)

or even


@mo.save
@functools.cache
def add(x, y):
  return x + y

x_plus_y = mo.save(add)

Signature.

The function has many arguments. Is sign_and_verify really necessary? What use case do you foresee for that?

dmadisetti commented 3 months ago
@mo.save("name")
def add(x, y):
  return x + y

x_plus_y = add()

?

I think that's doable. The implementation I was thinking of uses the call frame to build call context.

If a function is "pure" then the result is dependent directly on the arguments and not necessarily the execution path, so maybe that needs to be decoupled from this spec. functools.cache is set up to capture these cases.

There's also a nix analog for this (idk if it's in the thesis) with content addressable storage, which checks the contents of the inputs and not the input paths (what's in my spec here).

Is sign_and_verify really necessary?

The NAR scheme gives you guarantees of the values that go into your computation, but there's no method of checking the "correctness" of the result.

Nix gets around this by making the "store" strictly read-only + caching results externally. That means that the result is essentially tamper proof. marimo doesn't have these guarantees baked in. I think an external cache is a bit much, and an RO file system is not easily enforceable (nix is pretty invasive in this regard). But we can use something like a pgp signature to verify a computation result. Probably a bit more academic than practical, but was still trying to think through the "correct" way of doing things. It could just be a set it up and forget it like GitHub signatures.

As for the other arguments, it's more to allow overriding of needed, with the expectation it'll mostly be default.

akshayka commented 3 months ago

How about

@mo.save("name")
def add(x, y):
  return x + y

x_plus_y = add(1, 2)

as well as

@mo.save("name")
def add(x, y):
  return x + y + some_ref

x_plus_y = add(1, 2)

with mo.save() returning a callable whose signature matches the wrapped function? I understand that this doesn't match your spec (statically determined input paths). But otherwise (IIUC) users would need to rewrite this as

def add(x, y):
  return x + y

func = lambda: add(x_value, y_value)
x_plus_y = mo.save(func, "name")

which feels non-obvious.

If a function is "pure" then the result is dependent directly on the arguments and not necessarily the execution path, so maybe that needs to be decoupled from this spec ... There's also a nix analog for this (idk if it's in the thesis) with content addressable storage, which checks the contents of the inputs and not the input paths (what's in my spec here).

It would be really nice to be able to use the same function — mo.save — regardless of whether the function was "pure" ("refless") or not. Especially sincefunctools.cache doesn't save to disk.

It's okay if this isn't possible ... you've thought about this much more than I have. Thoughts?

Probably a bit more academic than practical, but was still trying to think through the "correct" way of doing things. It could just be a set it up and forget it like GitHub signatures.

Okay, sounds good, makes sense.

As for the other arguments, it's more to allow overriding of needed, with the expectation it'll mostly be default.

Sounds good.

dmadisetti commented 3 months ago

save could hide the details of hashing by content vs execution. A little naive, but the "primitive check" like in strict execution could be a way to determine "pure" functions. The pure case could kick in for the your first case example, and the execution path for the second, resolved by required_refs.

Keeping save results in memory like functools seems reasonable, execution saves could also invalidate the cache more meaningfully than functools too. I'm sure there's an edge case where it will eat up a users memory, but I think it's a premature optimization until someone complains.

For API

saved = mo.savable(foo, params)(a,b,c)

Or

with mo.saved_context(params):
    x = add(1, 2, 3)

The latter is tricky, but maybe possible with a bit of static analysis and swapping out the relevant functions, or hackily skipping the execution block and running it through an exec (which I think is less error prone for marimo, but might be harder for a user to debug)

Another option is saving on a cell level. Then you don't have to deal with passing in wrapped values, you can get a saved result through a flag.

akshayka commented 3 months ago

Another option is saving on a cell level. Then you don't have to deal with passing in wrapped values, you can get a saved result through a flag.

Yea I thought about that -- that could be pretty nice. If enabled, notebooks could populate with outputs instantly on startup, which would be pretty great.

saved = mo.savable(foo, params)(a,b,c)

So this would be a different API than mo.save? What are params and how are they different from a, b, and c?

I guess fundamentally there are two different types of caching, which to the user may feel the same but whose implementations are different (content addressed vs path addressed) And I'm wondering if we can have the same API for both. It's okay if we can't, but we should just be clear on that.

By the way. For all of this -- mo.save acts like a cache, in that it can and will invalidate old entries and delete them from disk. Right? Which means that you shouldn't use mo.save if you want to permanently save an artifact to disk or elsewhere (eg a plot, a numpy array, etc) . If I got that right, maybe the API should be called mo.cache.

dmadisetti commented 3 months ago
 So this would be a different API than mo.save? What are params and how are they different from a, b, and c?

params being the keywords like name and save_path. It's a little different. The with block is definitely possible though

I think it's doable, but it might be worth exposing the individual functions (execution_cache, function_cache) as well, because the behavior is going to be subtly different.

Right, so we haven't mentioned deleting at all- if runs are going to be invalidated then that's definitely worth documenting and renaming from save.

I think the execution path case has less of a reason to invalidate cache, but a function called many times may should likely be a "cache" over "save".

akshayka commented 3 months ago

params being the keywords like name and save_path

Got it, makes sense.

I think it's doable, but it might be worth exposing the individual functions (execution_cache, function_cache) as well, because the behavior is going to be subtly different.

Ok, good point about the behavior being different — could be save for the execution path case, cache for the one based on runtime values, if we end up adopting the saving/caching behavior you suggest.

Right, so we haven't mentioned deleting at all- if runs are going to be invalidated then that's definitely worth documenting and renaming from save.

Ah — in your spec I assumed "invalidate" meant "delete", but you're right saved values could be invalidated without actually deleting them from disk.

To bring it all together, we've proposed ...

  1. An API that saves to disk (and loads from disk) values based on ancestors + code, ie based on the execution path. I like your spec. When a saved value is "invalidated", we need to decide whether the saved artifact remains on disk.
  2. Maybe an API for caching function values to disk.
  3. Maybe overloading the API from 1 to handle 2 as well, but I see your point now why the two might be better separate.
  4. An option to bake "saving" into the runtime on a cell level — could be controlled via an app config. In this case saved values would likely be deleted from disk upon invalidation.

I'm all for some version of these features in marimo. How do you think we should proceed?

dmadisetti commented 2 months ago

Here's a proposed rollout:

1.) Execution-path first with the with statement, to remove any ambiguity w.r.t. the types of caching (the "function" cache doesn't really make sense in this context)

with mo.experimental.cached_context(name="expensive_calls", ...):
    x = expensive_call()
    y = another_expensive_call(x)
# x and y are "cached", there's a python workaround that can just conditionally skip this block, and populate x y from load
...
def cached_context(
                  pin_marimo_lib : bool = False, # could also go by "major" | "minor" | "patch" | None; default minor here on the safe side
                  pin_sys_libs : bool = False, # major
                  pin_local_libs : bool = True, # patch
                  max_size=None,
                  save_path : str = ".marimo_outputs", # or just outputs
                  *,
                  name : str):
    """
    same as above, just reduced. I think the load and sign fns were premature optimization
    additionally "max_size=None" which will specify the max number of saves before LRU cleanup.
    """

2.a) This will unlock cell level save / cache (basically the same as wrapping the whole block in a with, but can have tighter integration + would need some UI work / a CellConfig attribute)

2.b Can be done concurrently or before 2.a) Function level calls, mo.persistent_cache as a drop-in replacement for functools.cache, uses the save to disk infrastructure that 1. lays down, and more sophisticated argument checking


Future

akshayka commented 2 months ago

@dmadisetti , this looks good to me! We'd just need to make sure to get the naming right so that the difference between cached_context and persistent_cache was clear.

In my mind, down for starting work on 1) if you're still interested!

dmadisetti commented 2 months ago

@akshayka I want to limit feature creep like my last feature. Sending you a notebook over discord (failed at feature creep because both content addressed and execution path are implemented).

Play around with it, and give me your thoughts

Known bugs: