Open tmrusec opened 9 months ago
I've thought about this, the problem is that such a raise operator blocks most optimizations. For example, if you had a filter after this condition we would not be able to reorder anymore and place the filter before the computation.
I'm actually not familiar with the optimization priority.
I was wondering maybe it should be one of the methods of LazyFrame or DataFrame instead of expression. And what if we still prioritize the filter above the raise method?
So it can be:
df = pl.LazyFrame({
"A": [1,2,3,4,5],
"B": [1,2,0,4,5],
"C": [1,2,0,4,5],
})
col_name = "B"
(
df
.raise_when(
pl.col(col_name) == 0,
ZeroDivisionError,
f"Column '{col_name}' has a zero.",
)
.with_columns(
pl.col("A") / pl.col(col_name)
)
.filter(pl.col("C") != 0)
.collect()
)
But I don't know if it's going to work or improve the overall testing usecase?
Nice! I recently had something very similar in mind ;)
I was thinking about an assert
or expect
method on DataFrames and Series as follows:
df.with_columns(
# ... some calculations ...
).expect(
pl.col('a') > 0,
pl.col('b') < 100,
).with_columns(
# ... some calculations ...
).expect(
pl.col('c').null_count() == 0,
).with_columns(
# ... some calculations ...
)
Instead of
df1 = df.with_columns(
# ... some calculations ...
)
assert (df1.get_column('a') <= 0).sum() != 0
assert (df1.get_column('b') <= 0).sum() != 0
df2 = df1.with_columns(
# ... some calculations ...
)
assert df2.get_column('c').null_count() != 0
df3 = df2.with_columns(
# ... some calculations ...
)
I was also thinking about a config option to specify how to handle assert/expect failures like
pl.Config.set_assert_mode('fail'|'warn'|'ignore')
fail
: print/log problems and crashwarn
: print/log problems but continue executionignore
: completely ignore asserts (full optimization possible)@JulianCologne
Yes, and the purpose of this method is to utilize the query optimization. The position of the raise_when
or expect
method in the sequence of operations would be crucial. If placed early in a chain of operations, it could prevent unnecessary computations on data that would eventually trigger an exception. On the other hand, if placed at the end, it would act as a final check after all transformations. The optimizer would need to respect the position of raise_when
to ensure that exceptions are raised at the expected times.
If raise_when
were to be executed immediately, it would break the laziness of the evaluation, as it would require an immediate check on the data. To coexist with lazy evaluation, raise_when
would need to be integrated into the logical plan. It would represent an operation that, when the plan is executed, checks the data and raises an exception if the condition is met.
The optimizer would need to be aware that any subsequent operations might not be executed if the raise_when
condition is met. However, some optimizations might still be possible. For instance, if raise_when
is checking a condition on a column that is later filtered out, the check can be moved after the filter operation in the optimized plan.
Say there's a sort operation after raise_when
, and the raise_when
condition is met, the sort operation should never be executed. This might mean that raise_when
acts as a "barrier" to certain optimizations, limiting the reordering of operations around it.
Checking conditions on data can be computationally expensive, especially on large datasets. The optimizer would need to consider the cost of these checks when optimizing the query plan.
I just want to simply check if the columns I'm considering are in a One-Hot Encoding format for this from_dummies
function I made. If the col value is neither 1 or 0, it should raise a ValueError
. And I can do that as simple as using a raise_when
operation.
def _coalesce_expr(col_value_pairs):
return pl.coalesce(
pl.when(pl.col(col) == 1).then(pl.lit(value))
for col, value in col_value_pairs
)
def from_dummies(
df: pl.DataFrame, cols: list[str], separator: str = "_"
) -> pl.DataFrame:
col_exprs: dict = {}
for col in cols:
name, value = col.rsplit(separator, maxsplit=1)
col_exprs.setdefault(name, []).append((col, value))
return (
df
.raise_when(
pl.any_horizontal(pl.col(cols).ne(1).and_(pl.col(cols).ne(0))),
ValueError,
"Dummy DataFrame contains multi-assignment(s)"
)
.select(
pl.all().exclude(cols),
*[
_coalesce_expr(exprs).alias(name)
for name, exprs in col_exprs.items()
],
)
)
I was also thinking about a config option to specify how to handle assert/expect failures like
pl.Config.set_assert_mode('fail'|'warn'|'ignore')
* `fail`: print/log problems and crash * `warn`: print/log problems but continue execution * `ignore`: completely ignore asserts (full optimization possible)
@JulianCologne: Would your "fail" mode be more of a "fail_eager" or a "fail_lazy" mode? When running checks on your data like this, it's often useful to know just how much of it fails such sanity checks, rather than immediately stopping on the first bad row. I'm thinking something like 500/123456 lines failed the assertion "Column B is an SQL count(*) and can never be non-zero"
.
I've thought about this, the problem is that such a raise operator blocks most optimizations. For example, if you had a filter after this condition we would not be able to reorder anymore and place the filter before the computation.
The optimisation barrier concern does sound reasonnable, although most other such issues are usually just listed in the doc as suboptimal for most cases and to be used with care. Adding a similar warning to user errors/assertions would probably be good enough to mitigate that.
Personally I would go with a list of expressions that evaluate to booleans, similar to Julian's idea, where rows get tagged separately on whether or not they pass or fail each and the tally is reported as part of the exception raised if any of them have failed.
lazy_df
.operationA()
.operationB()
.assert_every(
(pl.col("foo") > 0, "Foo must be positive"),
(pl.col("my_list_cnt") == pl.col("my_list").list.len(), "Reported list length must match its actual length"),
# What to do with None values is an open question
(pl.when(pl.col("vehicle_type") == "car").then(pl.col("wheel_count") == 4), "All cars must have exactly 4 wheels"),
((pl.col("unitv_x") ** 2 + pl.col("unitv_y") ** 2 + pl.col("unitv_z") ** 2).is_between(0.99, 1.01), "Components of unit vector must add up to length 1."),
(pl.col("sample_count").n_unique() == 1, "sample_count must be uniform for the entire dataframe")
)
.collect()
@sm-Fifteen so "ignore" would just remove this branch entirely for full speed optimization. You might use "fail"/"warn" in testing/staging and then switch the config to "ignore" for production if you require speed only (you also might keep "fail"/"warn" if your workflow is not stable enough). This way there is no code change required other than the polars config.
One might also consider an additional "threshold" for "fail" specifying and total (e.g. 10_000) or percental (2%) failure rate that is allowed before stopping.
All in all in think we have a very similar idea of how this feature might work ;)
Wrote #16311 specifically about dropping/forgetting/ignoring values that are used only for assertions.
Shouldn't we rename this issue by now to something more descriptive, like "Introduce a data validation assertion mechanism"? It's getting cross-linked to and from a few other places, and the current title doesn't really reflect what it's about. I was looking for that issue a few weeks back, and only managed to find it because I had commented in it.
Description
Objective
Introduce a new expression, pl.testing.raise, that can be used within Polars operations to raise errors based on specified conditions.
Optimization Strategy
Proposed Syntax
pl.testing.raise_(ErrorType, "Error Message")
Example
Imagine you're calculating a derived column based on two other columns. If one of the columns has a zero and might cause a division error, you'd want to catch it.
I think this matches well with Polars' vision of being the lower-end library that other libraries can build on top of. Let me know what do you think :)