pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.4k stars 1.61k forks source link

Ordered Context Expressions #15292

Open OyiboRivers opened 1 month ago

OyiboRivers commented 1 month ago

Description

Hey Polars team,

It's sometimes important to ensure that expressions are evaluated in a particular order to maintain data integrity. However, there seems to be no built-in support for ordered context expressions. Users may use for loops as workarounds to manually enforce the desired evaluation order.

For example, if expressions depend on the results of earlier computations, it becomes challenging to ensure that these dependencies are resolved correctly without a mechanism to specify the evaluation order. In such cases, users may need to iterate over the expressions in a specific order using for loops to achieve the desired behavior.

Here is an example, where symbols are calculated by formulas with potential dependencies on prior evaluated symbols.

import polars as pl

data = pl.DataFrame({
    'A': [1, 2, 3],
    'B': [10, 20, 30],
    'C': [100, 200, 300],
}).cast(pl.Float64)

calc = pl.DataFrame([
    {'symbol': 'AB', 'formula': 'A/B', 'order': 1},
    {'symbol': 'AC', 'formula': 'A/C', 'order': 1},
    {'symbol': 'ABAC', 'formula': 'AB/AC', 'order': 2},
])

grouped = calc.filter(pl.col('order') > 0).group_by(['order'], maintain_order=True)

for name, group in grouped:
    iterator = group.select(['symbol', 'formula']).rows()
    sql_expr = [f"{formula} AS {symbol}" for symbol, formula in iterator]
    data = data.with_columns(*pl.sql_expr(sql_expr))
print(data)
# │ A   ┆ B    ┆ C     ┆ AB  ┆ AC   ┆ ABAC │
# ╞═════╪══════╪═══════╪═════╪══════╪══════╡
# │ 1.0 ┆ 10.0 ┆ 100.0 ┆ 0.1 ┆ 0.01 ┆ 10.0 │
# │ 2.0 ┆ 20.0 ┆ 200.0 ┆ 0.1 ┆ 0.01 ┆ 10.0 │
# │ 3.0 ┆ 30.0 ┆ 300.0 ┆ 0.1 ┆ 0.01 ┆ 10.0 │

Is it possible/practical to implement a group-wise or ordered context for expressions?

Kind regards, Oyibo

mcrumiller commented 1 month ago

If you have expressions that depend on prior expressions, you have two options:

  1. Use iterative with_columns:

    data = data.with_columns(expr_set1).with_columns(expr_set2).with_columns(expr_set3)...
  2. "Re-do" the computation with each successive expression. Note that due to common subexpression elimination (if you're doing this lazily), you will not pay the penalty of recomputing each time:

    
    expr1 = col("a") + col("b")
    expr2 = expr / 2
    expr3 = expr2 ** 2
    data = data.lazy().with_columns(
        expr1,  # no penalty for recomputing
        expr2,  # expr1 3x and 
        expr3,  # expr2 twice
    ).collect()
OyiboRivers commented 1 month ago

Hi @mcrumiller,

I appreciate your help and the time you've taken to answer my request. Your solutions are valid and solve the provided example. However, I had something different in mind.

Imagine we need to generate a dataset from base/raw data and want to insert derived data that must be computed in a specific sequence. This dataset follows a hierarchical tree structure of relationships, where each node represents a series or dataframe. The number of nodes can vary, from just a few to potentially thousands. The solution should be scalable.

Here's a breakdown of the process:

      Z        level=N
     / \
    Y ABAC     level=2
   /\ / \
  /  AB  AC    level=1
 /  / \ / \
X  B   A   C   level=0

Step 1: Define child-parent relationships. We define the relationships at each node, without knowing the calculation sequence or the number of levels. This is our initial information about the final dataset.

┌───────┬────────┐
│ child ┆ parent │
╞═══════╪════════╡
│ AB    ┆ A      │
│ AB    ┆ B      │
│ AC    ┆ A      │
│ AC    ┆ C      │
│ ABAC  ┆ AB     │
│ ABAC  ┆ AC     │
└───────┴────────┘

Step 2: Unfold child-parent relationships. This step involves unraveling the relationships to organize them by levels.

┌─────────┬─────────┬─────────┐
│ level_0 ┆ level_1 ┆ level_2 │
╞═════════╪═════════╪═════════╡
│ A       ┆ AB      ┆ ABAC    │
│ B       ┆ AB      ┆ ABAC    │
│ A       ┆ AC      ┆ ABAC    │
│ C       ┆ AC      ┆ ABAC    │
└─────────┴─────────┴─────────┘

Step 3: Extract the calculation sequence. By unfolding the tree, we determine the calculation sequence and the level/order of each node, enabling us to generate expressions for evaluation.

┌────────┬───────┐
│ symbol ┆ order │
╞════════╪═══════╡
│ A      ┆ 0     │
│ B      ┆ 0     │
│ C      ┆ 0     │
│ AB     ┆ 1     │
│ AC     ┆ 1     │
│ ABAC   ┆ 2     │
└────────┴───────┘

Note: Steps 1 to 3 are not part of the feature request. This should just illustrate how the the calculation sequence is derived.

The difficult part is how to evaluate these expressions efficiently. Currently, there is no built-in mechanism to update the context for each level automatically. Simply evaluating the expressions as they are will result in errors. For instance, executing the provided code snippet leads to a ColumnNotFoundError. This occurs as a consequence of evaluating expressions without updating the context.

import polars as pl

# Original data
data = pl.DataFrame({
    'A': [1, 2, 3],
    'B': [10, 20, 30],
    'C': [100, 200, 300],
}).cast(pl.Float64)

# Tree relationship of the dataset.
# Note: The column "order" is actually calculated and not hard coded.
calc = pl.DataFrame([
    {'symbol': 'AB', 'formula': 'A/B', 'order': 1},
    {'symbol': 'AC', 'formula': 'A/C', 'order': 1},
    {'symbol': 'ABAC', 'formula': 'AB/AC', 'order': 2},
])

# get symbols/formulas filtered and sorted by calculation order
iterator = calc.filter(pl.col('order') > 0).sort('order').select(['symbol', 'formula']).rows()

# make expressions
sql_expr = [f"{formula} AS {symbol}" for symbol, formula in iterator]

# evaluate expressions
data = data.lazy().with_columns(*pl.sql_expr(sql_expr)).collect()

# ColumnNotFoundError: AB
# Error originated just after this operation:
# DF ["A", "B", "C"]; PROJECT */3 COLUMNS; SELECTION: "None"

While a for loop can be used as a workaround, it would be great to have a built-in method to handle grouped/ordered expressions.

OyiboRivers commented 1 month ago

Another idea for a workaround was to generate new empty columns and then update the values in a following sequential context. Unfortunately, this does not work since the column values will not be updated within the same context.

# make expressions to generate empty columns
create_empty_columns = [
    pl.lit(None).cast(pl.Float64).alias(name)
    for name in calc.get_column('symbol').to_list()]

# make expressions to calculate values
calculate_values = pl.sql_expr(
    [f"{formula} AS {symbol}"
     for symbol, formula in calc.select(['symbol', 'formula']).rows()])

# evaluate expressions
data = (
    data.lazy()
    .with_columns(*create_empty_columns)
    .with_columns_seq(*calculate_values)
    .collect()
)
print(data)
# ┌─────┬──────┬───────┬─────┬──────┬──────┐
# │ A   ┆ B    ┆ C     ┆ AB  ┆ AC   ┆ ABAC │
# ╞═════╪══════╪═══════╪═════╪══════╪══════╡
# │ 1.0 ┆ 10.0 ┆ 100.0 ┆ 0.1 ┆ 0.01 ┆ null │
# │ 2.0 ┆ 20.0 ┆ 200.0 ┆ 0.1 ┆ 0.01 ┆ null │
# │ 3.0 ┆ 30.0 ┆ 300.0 ┆ 0.1 ┆ 0.01 ┆ null │
# └─────┴──────┴───────┴─────┴──────┴──────┘