marimo-team / marimo

A reactive notebook for Python — run reproducible experiments, execute as a script, deploy as an app, and version with git.
https://marimo.io
Apache License 2.0
7.96k stars 278 forks source link

Persistent cache with a polars dataframe #2661

Open AdrienDart opened 1 month ago

AdrienDart commented 1 month ago

Describe the bug

Hi,

I'm trying to save a polars dataframe in cache using the following operation.

from vega_datasets import data
import polars as pl
df = data.iris().pipe(pl.from_pandas)
with mo.persistent_cache('my_cache'):
    df1 = df

I get TypeError("Cannot change data-type for object array.") (sorry I can't post the whole traceback, issue at line 217 in data_to_buffer in hash.py) Is that expected?

A monkey patch that works is:

df = df.lazy()
with mo.persistent_cache('my_cache'):
    df1 = df.collect()

Thanks,

Adrien

Environment

Marimo 0.9.10

Code to reproduce

See above.

dmadisetti commented 1 month ago

No, this looks like a bug, marimo should detect whether the object is serializable in the way it expects. This exception is thrown when there's that discrepancy. There's a bit of dataframe checking logic under the hood, so I think this might be solved by moving that logic to narwhals

Thanks for the easily reproducible code. You may be able to get around this by putting defining df in a separate cell in the meantime.

AdrienDart commented 4 weeks ago

Also, quick question, I notice the cached dataframe is saved as a pickle, could it be saved as a parquet for better performance/memory usage? Thanks for your help!

dmadisetti commented 4 weeks ago

Sure, I don't think any given file format should replace pickle, but maybe we'll expose a setting to choose a "loader" type.

Here's the pickle loader for your reference, I don't think it'd be too tricky to implement for any given storage type:

https://github.com/marimo-team/marimo/blob/main/marimo/_save/loaders/pickle.py

Couple other thoughts were npz, dill, and remote cache.

If you did want to play with this, the undocumented keyword arg _loader would let you inject a loader instance. You can see how we do this in testing: https://github.com/marimo-team/marimo/blob/45056be4c37ed79e28370222f2e7bd89c017050c/tests/_save/test_cache.py#L49