pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
12 stars 2 forks source link

feat(): Added ignore_position_hash option for SQLTableStore #202

Open DominikZuercherQC opened 4 weeks ago

DominikZuercherQC commented 4 weeks ago

Added ignore_position_hashes option to SQLTableStore. If True, the position hashes of tasks are not checked when retrieving the inputs of a task from the cache. This can prevent caching errors when evaluating subgraphs. For this to work a task may never be used more than once per stage.

Checklist

windiana42 commented 1 week ago

@DominikZuercherQC I cannot strictly say it is wrong, but I think SQLTableStore is not the correct level to configure whether to ignore position hash. SQLTableStore manages metadata of caching, but conceptionally, caching is a feature of pipedag orchestration itself.

To find the correct place of configuration, I have a few questions: 1) Do you really think disable_position_hash is a feature that people like to configure per instance? 2) Ignoring position hash is a dangerous feature that can lead to invalid cache recovery. Thus, I rather see it as a debugging mode and a big warning should inform the user that he is at his own risk. Would it work to only enable it as option to Flow.run()? It mainly makes sense in combination with running subflows which are also configured there.

DominikZuercherQC commented 6 days ago

Do you really think disable_position_hash is a feature that people like to configure per instance?

No, I agree that this is probably more of a global choice. So no need to make this configurable per instance.

Ignoring position hash is a dangerous feature that can lead to invalid cache recovery. Thus, I rather see it as a debugging mode and a big warning should inform the user that he is at his own risk. Would it work to only enable it as option to Flow.run()? It mainly makes sense in combination with running subflows which are also configured there.

In our case the position_hash regularly causes issues when trying to rerun subflows. So I would actually like to disable it permanently in our pipeline just to make development easier. I do not see that much potential for danger here as long as one does not use two tasks with the same name in a stage (could we not even add a check for this actually?) or am I missing something. However, I would be totally fine to only have this as an option for Flow.run().