pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.22k stars 1.67k forks source link

Feature Proposal: Introducing `pl.testing.raise_` in Polars #11064

Open tmrusec opened 9 months ago

tmrusec commented 9 months ago

Description

Objective

Introduce a new expression, pl.testing.raise, that can be used within Polars operations to raise errors based on specified conditions.

Optimization Strategy

Proposed Syntax

pl.testing.raise_(ErrorType, "Error Message")

Example

Imagine you're calculating a derived column based on two other columns. If one of the columns has a zero and might cause a division error, you'd want to catch it.

col_name = "B"
df.with_column(
    pl.when(pl.col(col_name) == 0)
    .then(pl.testing.raise_(ZeroDivisionError, f"Column '{col_name}' has a zero."))
    .otherwise(pl.col("A") / pl.col(col_name))
)

I think this matches well with Polars' vision of being the lower-end library that other libraries can build on top of. Let me know what do you think :)

orlp commented 9 months ago

I've thought about this, the problem is that such a raise operator blocks most optimizations. For example, if you had a filter after this condition we would not be able to reorder anymore and place the filter before the computation.

tmrusec commented 9 months ago

I'm actually not familiar with the optimization priority.

I was wondering maybe it should be one of the methods of LazyFrame or DataFrame instead of expression. And what if we still prioritize the filter above the raise method?

So it can be:

df = pl.LazyFrame({
  "A": [1,2,3,4,5],
  "B": [1,2,0,4,5],
  "C": [1,2,0,4,5],
})

col_name = "B"

(
    df
    .raise_when(
        pl.col(col_name) == 0,
        ZeroDivisionError,
        f"Column '{col_name}' has a zero.",
    )
    .with_columns(
      pl.col("A") / pl.col(col_name)
    )
    .filter(pl.col("C") != 0)
    .collect()
)

But I don't know if it's going to work or improve the overall testing usecase?

JulianCologne commented 9 months ago

Nice! I recently had something very similar in mind ;)

I was thinking about an assert or expect method on DataFrames and Series as follows:

df.with_columns(
    # ... some calculations ...
).expect(
    pl.col('a') > 0,
    pl.col('b') < 100,
).with_columns(
    # ... some calculations ...
).expect(
    pl.col('c').null_count() == 0,
).with_columns(
    # ... some calculations ...
)

Instead of

df1 = df.with_columns(
    # ... some calculations ...
)

assert (df1.get_column('a') <= 0).sum() != 0
assert (df1.get_column('b') <= 0).sum() != 0

df2 = df1.with_columns(
    # ... some calculations ...
)

assert df2.get_column('c').null_count() != 0

df3 = df2.with_columns(
    # ... some calculations ...
)

I was also thinking about a config option to specify how to handle assert/expect failures like pl.Config.set_assert_mode('fail'|'warn'|'ignore')

tmrusec commented 9 months ago

@JulianCologne Yes, and the purpose of this method is to utilize the query optimization. The position of the raise_when or expect method in the sequence of operations would be crucial. If placed early in a chain of operations, it could prevent unnecessary computations on data that would eventually trigger an exception. On the other hand, if placed at the end, it would act as a final check after all transformations. The optimizer would need to respect the position of raise_when to ensure that exceptions are raised at the expected times.

Immediate Evaluation vs Lazy Evaluation

If raise_when were to be executed immediately, it would break the laziness of the evaluation, as it would require an immediate check on the data. To coexist with lazy evaluation, raise_when would need to be integrated into the logical plan. It would represent an operation that, when the plan is executed, checks the data and raises an exception if the condition is met.

Optimization Considerations

The optimizer would need to be aware that any subsequent operations might not be executed if the raise_when condition is met. However, some optimizations might still be possible. For instance, if raise_when is checking a condition on a column that is later filtered out, the check can be moved after the filter operation in the optimized plan.

Interaction with Other Operations

Say there's a sort operation after raise_when, and the raise_when condition is met, the sort operation should never be executed. This might mean that raise_when acts as a "barrier" to certain optimizations, limiting the reordering of operations around it.

Performance Implications

Checking conditions on data can be computationally expensive, especially on large datasets. The optimizer would need to consider the cost of these checks when optimizing the query plan.

Consider the following example

I just want to simply check if the columns I'm considering are in a One-Hot Encoding format for this from_dummies function I made. If the col value is neither 1 or 0, it should raise a ValueError. And I can do that as simple as using a raise_when operation.

def _coalesce_expr(col_value_pairs):
    return pl.coalesce(
        pl.when(pl.col(col) == 1).then(pl.lit(value))
        for col, value in col_value_pairs
    )

def from_dummies(
    df: pl.DataFrame, cols: list[str], separator: str = "_"
) -> pl.DataFrame:
    col_exprs: dict = {}

    for col in cols:
        name, value = col.rsplit(separator, maxsplit=1)
        col_exprs.setdefault(name, []).append((col, value))

    return (
        df
        .raise_when(
            pl.any_horizontal(pl.col(cols).ne(1).and_(pl.col(cols).ne(0))),
            ValueError,
            "Dummy DataFrame contains multi-assignment(s)"
        )
        .select(
            pl.all().exclude(cols),
            *[
                _coalesce_expr(exprs).alias(name)
                for name, exprs in col_exprs.items()
            ],
        )
    )
sm-Fifteen commented 7 months ago

I was also thinking about a config option to specify how to handle assert/expect failures like pl.Config.set_assert_mode('fail'|'warn'|'ignore')

* `fail`: print/log problems and crash

* `warn`: print/log problems but continue execution

* `ignore`: completely ignore asserts (full optimization possible)

@JulianCologne: Would your "fail" mode be more of a "fail_eager" or a "fail_lazy" mode? When running checks on your data like this, it's often useful to know just how much of it fails such sanity checks, rather than immediately stopping on the first bad row. I'm thinking something like 500/123456 lines failed the assertion "Column B is an SQL count(*) and can never be non-zero".

I've thought about this, the problem is that such a raise operator blocks most optimizations. For example, if you had a filter after this condition we would not be able to reorder anymore and place the filter before the computation.

The optimisation barrier concern does sound reasonnable, although most other such issues are usually just listed in the doc as suboptimal for most cases and to be used with care. Adding a similar warning to user errors/assertions would probably be good enough to mitigate that.


Personally I would go with a list of expressions that evaluate to booleans, similar to Julian's idea, where rows get tagged separately on whether or not they pass or fail each and the tally is reported as part of the exception raised if any of them have failed.

lazy_df
    .operationA()
    .operationB()
    .assert_every(
        (pl.col("foo") > 0, "Foo must be positive"),
        (pl.col("my_list_cnt") == pl.col("my_list").list.len(), "Reported list length must match its actual length"),
        # What to do with None values is an open question
        (pl.when(pl.col("vehicle_type") == "car").then(pl.col("wheel_count") == 4), "All cars must have exactly 4 wheels"),
        ((pl.col("unitv_x") ** 2 + pl.col("unitv_y") ** 2 + pl.col("unitv_z") ** 2).is_between(0.99, 1.01), "Components of unit vector must add up to length 1."),
        (pl.col("sample_count").n_unique() == 1, "sample_count must be uniform for the entire dataframe")
    )
    .collect()
JulianCologne commented 7 months ago

@sm-Fifteen so "ignore" would just remove this branch entirely for full speed optimization. You might use "fail"/"warn" in testing/staging and then switch the config to "ignore" for production if you require speed only (you also might keep "fail"/"warn" if your workflow is not stable enough). This way there is no code change required other than the polars config.

One might also consider an additional "threshold" for "fail" specifying and total (e.g. 10_000) or percental (2%) failure rate that is allowed before stopping.

All in all in think we have a very similar idea of how this feature might work ;)

KDruzhkin commented 1 month ago

Wrote #16311 specifically about dropping/forgetting/ignoring values that are used only for assertions.

sm-Fifteen commented 3 weeks ago

Shouldn't we rename this issue by now to something more descriptive, like "Introduce a data validation assertion mechanism"? It's getting cross-linked to and from a few other places, and the current title doesn't really reflect what it's about. I was looking for that issue a few weeks back, and only managed to find it because I had commented in it.