Imperative Materialization

windiana42 commented 6 months ago

In some cases, it might make sense to imperatively trigger materialization in the middle of a task:

@materialize(lazy=True, input_type=sa.Table)
def task():
  sql = sa.select(sa.literal(1).label("a"))
  tbl = dag.Table(sql).materialize()
  sql2 = sa.select(tbl).limit(1)
  return dag.Table(sql2)

On first sight this looks like a crazy violation of principles that are important to make automatic cache invalidation work. On the other hand, it can be implemented tolerating a bit more magic. In fact it would be nice if it is even possible to close with a materialize: return dag.Table(sql2).materialize(). It would make no difference except that any errors during materialization would still be reported with stack trace of task open => much easier debugging.

The magic that needs to happen is that in case multiple dag.Table().materialize() happen inside a task, the situation of the first being cache valid and the second cache invalid will cause trouble. There are multiple ways out of this trouble: 1) We throw an exception on the second materialize() that is cache-invalid. This will trigger a rerun of the task where the first is executed as well. 2) The first materialize returns a reference to the cached object instead of a fake reference.

Conceptional ideas for the magic:

Table.materialize communicates with Orchestration/Materialization via TaskContext that is open during task execution
Table.materialize() should take an optional engine + schema argument for interactive debugging. They should be auto-filled with RunContext/ConfigContext/TaskContext information. The engine can also be provided with a ConfigContext object which can be produced nearly as easy as creating a sqlalchemy object nowadays.
References returned by Table.materialize should be stored somewhere, so at the end of a task, it is easy to see what materializations were triggered by the task. Also we should keep the Table objects that were materialized. In case the task returns such references, they should be replaced by their respective Table object knowing that materialization already happened.
It would probably be prohibitively hard to guarantee that we understand the relation between two queries that are materialized one after another. Thus we need to assume that any subsequent materialize() call is dependent on preceding ones. This means if the preceding one was cache-invalid, the subsequent ones are as well. We might add some way for the users to say that several materializations are guaranteed independent. For example: return *dag.materialize(dag.Table(t1), dag.Table(t2)). In case we receive both args and kwargs to dag.materialize(), we might either error or give synthetic names to the args within the returned dictionary.

windiana42 commented 6 months ago

As I am writing this, I am favoring the solution:

The first materialize returns a reference to the cached object instead of a fake reference.

I think there is not that much that has to change for this to happen. Not even in the cache invalidation logic. The logic which table depends on which other one (subsequent materialize() calls) is available at runtime by the reproduced sequence of materialize() calls and does not need to be represented in metadata tables.

windiana42 commented 6 months ago

For the implementation of option 2., there are a few detailed decisions necessary:

We support calling tasks without RunContext open for dataframe tasks that simply take a dataframe and return one. In general we like that return dag.Table(...) should be identical to return dag.Table(...).materialize(). Thus it is unclear what Table.materialize() should do in case no RunContext is open. One idea would to not do anything and simply to return Table.obj. That is quite different to what happens with RunContext open, but in fact the usability is probably maximized by this measure.
With RunContext open, it would be nice to offer three options for what to return from Table.materialize(): a) Return a reference with the exact same type as input_type of the task. b) Return an explicitly given return_as_type object. c) Return None and don't dematerialize after writing the table. There are plenty of ways how to design the call interface. One option is Table.materialize(config_context: ConfigContext | None = None, return_as_type = None, return_nothing:bool = False, drop_if_exists:bool = False). I suggest to interpret return_as_type=None as take input_type from Task if respective TableHook returns a reference (TableHooks need to say themselves what they do). If the input_type TableHook does not return a reference, I suggest returning a sqlalchemy. Table reference by default.
We don't really need the debug.materialize_table function any more with Table.materialize(drop_if_exists=True). I would still keep it for more control of options (debug_suffix, flag_task_debug_tainted). The implementation of both can be shared though. I would also set table.debug_tainted=True in Table.materialize() if it is executed multiple times for the same table object.
We assume that every Table.materialize() is a dependency of every subsequent Table.materialize() call or any Table returned by the task. This can be reflected with a private Table property that is also JSON serialized and thus affects cache invalidation. Table._assumed_dependencies=[Table(...), Table(...), ...]

pydiverse / pydiverse.pipedag

Imperative Materialization #164