pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
27 stars 3 forks source link

Optimize Log initialization for parallel execution and prefect integration #91

Open windiana42 opened 1 year ago

windiana42 commented 1 year ago

Log initialization is a tricky subject. The logging library does not provide satisfactory guidance for various software engineering usecases. Thus everyone hacks around this gap and those hacks can interact badly.

Ideal world:

Late initialization:

Problems for log initialization that need solving:

windiana42 commented 1 year ago

@NMAC427 I am currently not sure how much effort we want to invest in prefect. It is used in some projects today. But I couldn't find a way to connect runs which are not triggered by an explicit prefect process to the run-database which the UI would pick up. In addition, prefect 2 seems to go in a direction (async-await) which is different from pipedag which means interests for features to come might not be aligned. You can see this in the radar diagram of displaying tasks which is super ugly in case the data pipeline in-deed is rather linear with some branching out happening in each pipeline stage: https://www.prefect.io/guide/blog/introducing-radar/ However, prefect 2.10 supports more dag like diagrams now again: https://docs.prefect.io/2.10.17/

So the investigation of alternative DAG orchestration UIs might influence the importance of this issue. If it is easy to fix prefect compatibility, it might be a no-brainer to do it since we already support it as a backend.