pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Create "virtual" stacktrace which combines wiring time and task execution time #115

Open windiana42 opened 11 months ago

windiana42 commented 11 months ago

There are trivial errors like a materialize task that takes 2 arguments but is given 3 arguments when being called. The error message is much easier to see if we provide an alternative stack trace on errors that links wiring time and task execution time.

All we need to do is to store the stacktrace for each task call during wiring and link this in a smart way with error stack traces. It is a research task to find out how best to present both stacktraces to users.

mohahf19 commented 8 months ago
windiana42 commented 5 months ago

The general idea would be to make logging more configurable with this PR. I also see issues with Github actions with the current setting. The logs of stacktraces are very verbose (especially long with narrow width):

│ /home/runner/work/pydiverse.pipedag/pydiverse.pipedag/src/pydiverse/pipedag/materialize/core.py: │
│ 599 in __call__                                                                                  │
│                                                                                                  │
│   596 │   │   │   │   task, bound.args, bound.kwargs                                             │
│   597 │   │   │   )                                                                              │
│   598 │   │   │                                                                                  │
│ ❱ 599 │   │   │   result = self.fn(*args, **kwargs)                                              │
│   600 │   │   │   if task.debug_tainted:                                                         │
│   601 │   │   │   │   raise RuntimeError(                                                        │
│   602 │   │   │   │   │   f"The task {task.name} has been tainted by interactive debugging."     │
│
This step has been truncated due to its large size. Download the full logs from the
menu once the workflow run has completed. 

I like to keep printing all rendered queries though. This is super efficient when analyzing problems. Since we leave intermediate outputs (caches) lying around, we like to quickly grab the rendered query for every transformation between two tables. A solution might be to redirect this information to some place where it can easily be looked up. It might even be an interesting idea to keep it in the database such that you can create an IDE plugin that retrieves the query that generated a table just by clicking the table. This would be a separate Issue. But I would like to reference the idea here because it ties into logging configurability and tuning of default settings for it.