pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.39k stars 1.97k forks source link

Error Message Improvement: Combine/Collect multiple Errors #10475

Open Julian-J-S opened 1 year ago

Julian-J-S commented 1 year ago

Problem description

Problem

Error messages from parallel executions are not combined but only the first error message is displayed.

Reason

Imagine an ETL workflow where a CSV has 10 column of which 5 have corrupt values

Current procedure:

Instead of:

Example

DATA = '''
ints,floats,dates
1,abc,123
hello,1.23,456
2,4.56,2023-01-01
world,def,2023-02-01
'''

df = pl.read_csv(
    source=StringIO(DATA)
).with_columns(
    # 1
    pl.col('ints').cast(pl.Int64),
    # 2
    pl.col('floats').cast(pl.Float64),
    # 3
    pl.col('dates').cast(pl.Date),
)

The individual error messages are:

But what the user sees is actually only the first:

Desired result

combination of error messages into one

reswqa commented 1 year ago

I'm not sure if this is easy to achieve.

When we evaluate multiple expressions, we should tranform Vec<Result<Series, Err>> to Result<Vec<Series>, Err>.

The final result depends on the way we evaluate it:

ritchie46 commented 1 year ago

I don't think we should do that. I haven't seen an interpreter that keeps running once an error is encountered.

This would also have non trivial complexity and assume we always can continue at error. That means we must change assumptions and internal state, making some optimizations impossible.

Julian-J-S commented 1 year ago

Maybe there is a small misunderstanding. Let me explain with an example:

df.with_columns(
    # BLOCK A
    calc1 = ...
    calc2 = ...
).with_columns(
    # BLOCK B
    calc3 = calc1...
    ...
)

As far as I understand the calculations of "BLOCK A" are independet of each other and run in parallel/concurrent. So it should be possible to gather all results (successfuls and failures) of that block and display all errors that are present or otherwise continue?

I do NOT expect the interpreter to run "BLOCK B" after an error in "BLOCK A". I would like to see ALL errors happening in a block ("BLOCK A") and then stop (if any error present)

Here a common python programm to run tasks concurrently and return ALL results (errors + values)

import asyncio

async def foo():
    raise ValueError("Foo ValueError!!")

async def bar():
    return 5

async def baz():
    raise NotImplementedError()

async def main():
    results = await asyncio.gather(
        foo(),
        bar(),
        baz(),
        return_exceptions=True,
    )

    print(results)
    # [ValueError('Foo ValueError!!'), 5, NotImplementedError()]

asyncio.run(main())