pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Copy task inputs to protect against modifications before execution time #137

Open nicolasmueller opened 6 months ago

nicolasmueller commented 6 months ago

Currently task inputs may me modified after the task has been declared. This leads to the unintuitive situation that a task may receive different arguments at runtime than at declaration time. In particular constructions like

a = inputs()
tables: dict[str, Table] = {"dfA": a}
a2 = double_a(tables)
tables["dfA2"] = a2
halfen_tables(tables)

crash at flow runtime, because double_a at runtime depends on the last version of tables which contains "dfA2" already.

Note that a fix for this breaks test_change_bound_argument.

Is the current behavior intended?

Checklist

NMAC427 commented 6 months ago

I think I might have written the test_change_bound_argument test case without properly thinking about what the actual intended behaviour should be. I agree that modifying the arguments after binding them to a task shouldn't have an influence on the task outcome. While I initially had some concerns about the use of deepcopy, I think that it shouldn't lead to any issues, because tasks only support a limited set of input types.

NicolasMuellerQC commented 6 months ago

@NicholasHoernleQC what's your opinion on this :) , i.e. should we copy and freeze task inputs at declare time?

windiana42 commented 5 months ago

I think we can go for the "safe" solution to deep_copy the inputs even though I recommend noone to write code where this matters. Performance slowdown should be negligible compared with what computations we expect to be performed in pipedag. Thus modifying the test is fine.