Caching (docs / examples / tests)

scipp / sciline

Build scientific pipelines for your data

BSD 3-Clause "New" or "Revised" License

10 stars 2 forks source link

Caching of intermediate results is probably out of scope for Sciline. However, it could be useful to provider helpers (such as decorators) that a user can use for caching objects that may get reused across multiple compute() calls. For example, downloading a big file, loading a big file, ...

By making this an explicit wrapper instead of trying to implement a complex and hard to control internal mechanism we:

Force users to think about whether and where they actually needs this.
Give full control (which things will be cached).
Keep Sciline simple.

An alternative would be to recommend computing the intermediate result directly, and providing this as an instance-provider to Pipeline. One important requirement (for either solution) would be that it can be turned on or off with ease.

Note that for 0-ary functions a user can simply use, e.g., functools.lru_cache.

For unary (or higher functions), lru_cache will prevent the the repeated call to the function, but not to its dependencies. That is, this may still be useful for, e.g., a unary function that takes a filename as input, but not for avoiding computation of an entire expensive branch of the task tree.

For now, I would suggest to:

Add unit tests for the cases discussed above.
Add a documentation page with examples.
Defer the more complex problem of pruning a branch if the node leading to it is cached. We should consider this in the future, but only once we have data supporting the case that it is useful and required.

scipp / sciline

Caching (docs / examples / tests) #30