pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
19 stars 3 forks source link

Propose redesign of cache validation options in configuration #170

Closed windiana42 closed 6 months ago

windiana42 commented 6 months ago

Checklist

windiana42 commented 6 months ago

This is an alternative proposal for #166

windiana42 commented 6 months ago

@nicolasmueller @NMAC427 what do you think about the following configuration options for cache validation. In my opinion everything that one could wish for should be choosable this way:

cache_validation
: See [](#section-cache_validation). *Optional*

table_store
: See [](#section-table_store). *Required*
...

(section-cache_validation)=
### Cache Validation options

mode
: Choose a mode of cache invalidation. 

  Supported values:
  - `normal`: Normal cache invalidation.
  - `assert_no_fresh_input`: Same as `ignore_fresh_input` and additionally fail if tasks having
      a cache function would still be executed (change in version or lazy query).
  - `ignore_fresh_input`: Ignore the output of cache functions that help determine the availability of fresh input.
        With `disable_cache_function=False`, it still calls cache functions, so cache invalidation works interchangeably 
        between between `ignore_fresh_input` and `normal`.
  - `force_fresh_input`: Consider all cache function outputs as different and thus make source tasks cache invalid.
  - `force_caches_invalid`: Disable caching and thus force all tasks as cache invalid.
        This option implies `force_fresh_input`.

  (default: `normal`) 

disable_cache_function
: When set to `True`, cache functions are not called. This is not compatible with `mode=normal`. The difference to 
`ignore_fresh_input` is that in case mode is set back to `normal`, the cache becomes invalid if disable_cache_function 
was set to `True` during last run.

  (default: `False`)

ignore_task_version
: When set to `True`, tasks that specify an explicit version for cache invalidation will always be considered cache invalid. 
  This might be useful for instances with short execution time during rapid development cycles when manually bumping version numbers becomes cumbersome.

  (default: `False`)
windiana42 commented 6 months ago

This is what I propose for Flow.run():

    def run(
        self,
        *components: Task | TaskGetItem | Stage,
        config: ConfigContext = None,
        orchestration_engine: OrchestrationEngine = None,
        fail_fast: bool | None = None,
        cache_validation_mode: CacheValidationMode | None = None,
        disable_cache_function: bool | None = None,
        ignore_task_version: bool | None = None,
        **kwargs,
    ) -> Result:
windiana42 commented 6 months ago

everything that one could wish for should be choosable this way

Close but not completely true. I coupled the execution of the auto_version determination run also to the disable_cache_function flag. It behaves very similar in the sense that disable_cache_function=True executes less but prevents cache valid run on next invocation with mode=NORMAL.

windiana42 commented 6 months ago

Tests got massively slower by this PR. I think the combinatorial tests are still valuable. So I suggest to keep them for now. However, a future PR might simply deactivate them or move them to nightly tests only.