Create "virtual" stacktrace which combines wiring time and task execution time

windiana42 commented 11 months ago

There are trivial errors like a materialize task that takes 2 arguments but is given 3 arguments when being called. The error message is much easier to see if we provide an alternative stack trace on errors that links wiring time and task execution time.

All we need to do is to store the stacktrace for each task call during wiring and link this in a smart way with error stack traces. It is a research task to find out how best to present both stacktraces to users.

mohahf19 commented 8 months ago

for run time errors, collect the stack from declaration and print it
maybe add a configuration to define which stack / how to print the stack traces
maybe revisit the fancy logging printing (esp with configuration)

windiana42 commented 5 months ago

The general idea would be to make logging more configurable with this PR. I also see issues with Github actions with the current setting. The logs of stacktraces are very verbose (especially long with narrow width):

│ /home/runner/work/pydiverse.pipedag/pydiverse.pipedag/src/pydiverse/pipedag/materialize/core.py: │
│ 599 in __call__                                                                                  │
│                                                                                                  │
│   596 │   │   │   │   task, bound.args, bound.kwargs                                             │
│   597 │   │   │   )                                                                              │
│   598 │   │   │                                                                                  │
│ ❱ 599 │   │   │   result = self.fn(*args, **kwargs)                                              │
│   600 │   │   │   if task.debug_tainted:                                                         │
│   601 │   │   │   │   raise RuntimeError(                                                        │
│   602 │   │   │   │   │   f"The task {task.name} has been tainted by interactive debugging."     │
│
This step has been truncated due to its large size. Download the full logs from the
menu once the workflow run has completed.

I like to keep printing all rendered queries though. This is super efficient when analyzing problems. Since we leave intermediate outputs (caches) lying around, we like to quickly grab the rendered query for every transformation between two tables. A solution might be to redirect this information to some place where it can easily be looked up. It might even be an interesting idea to keep it in the database such that you can create an IDE plugin that retrieves the query that generated a table just by clicking the table. This would be a separate Issue. But I would like to reference the idea here because it ties into logging configurability and tuning of default settings for it.

pydiverse / pydiverse.pipedag

Create "virtual" stacktrace which combines wiring time and task execution time #115