Closed rlartiga closed 2 months ago
thanks for the report, can confirm this reproduces
to expedite resolution, it would probably be very helpful if you could narrow down the example so it's as small as possible
@MarcoGorelli I will try
Also a company friend told me that the issue started at the version 0.19 if we try with 0.18.15 the result with the optimization is the same as without. I tested and confirmed that.
It took some effort, but here's a small-ish example which reproduces the same issue:
import polars as pl
print(pl.__version__)
holdings = pl.DataFrame({
'fund_currency': ['CLP', 'CLP'],
'asset_currency': ['EUR', 'USA'],
})
usd = ["USD"]
eur = ["EUR"]
clp = ['CLP']
factor_query_dict = {}
currency_factor_query_dict = {
"CURRENCY_EUR_FUND_CLP": pl.col("asset_currency").is_in(eur)
& pl.col("fund_currency").is_in(clp),
"CURRENCY_EUR_FUND_USD": pl.col("asset_currency").is_in(eur)
& pl.col("fund_currency").is_in(usd),
"CURRENCY_CLP_FUND_CLP": pl.col("asset_currency").is_in(clp)
& pl.col("fund_currency").is_in(clp),
"CURRENCY_USD_FUND_USD": pl.col("asset_currency").is_in(usd)
& pl.col("fund_currency").is_in(usd),
}
factor_holdings = holdings.lazy().with_columns(
[
pl.coalesce(
pl.when(polars_query).then(pl.lit(factor))
for factor, polars_query in currency_factor_query_dict.items()
).alias("currency_factor"),
]
)
print(factor_holdings.collect())
print(factor_holdings.collect(comm_subexpr_elim=False))
Output:
0.20.25
shape: (2, 3)
┌───────────────┬────────────────┬─────────────────┐
│ fund_currency ┆ asset_currency ┆ currency_factor │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════════╪════════════════╪═════════════════╡
│ CLP ┆ EUR ┆ null │
│ CLP ┆ USA ┆ null │
└───────────────┴────────────────┴─────────────────┘
shape: (2, 3)
┌───────────────┬────────────────┬───────────────────────┐
│ fund_currency ┆ asset_currency ┆ currency_factor │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════════╪════════════════╪═══════════════════════╡
│ CLP ┆ EUR ┆ CURRENCY_EUR_FUND_CLP │
│ CLP ┆ USA ┆ null │
└───────────────┴────────────────┴───────────────────────┘
going to nerd-snipe tag @ritchie46 on this one
Here's a slightly more minimal version:
import polars as pl
print(pl.__version__)
holdings = pl.DataFrame(
{
"fund_currency": ["CLP", "CLP"],
"asset_currency": ["EUR", "USA"],
}
)
usd = ["USD"]
eur = ["EUR"]
clp = ["CLP"]
currency_factor_query_dict = [
pl.col("asset_currency").is_in(eur) & pl.col("fund_currency").is_in(clp),
pl.col("asset_currency").is_in(eur) & pl.col("fund_currency").is_in(usd),
pl.col("asset_currency").is_in(clp) & pl.col("fund_currency").is_in(clp),
pl.col("asset_currency").is_in(usd) & pl.col("fund_currency").is_in(usd),
]
factor_holdings = holdings.lazy().with_columns(
[
pl.coalesce(currency_factor_query_dict).alias("currency_factor"),
]
)
print(factor_holdings.collect())
print(factor_holdings.collect(comm_subexpr_elim=False))
Output:
0.20.25
shape: (2, 3)
┌───────────────┬────────────────┬─────────────────┐
│ fund_currency ┆ asset_currency ┆ currency_factor │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ bool │
╞═══════════════╪════════════════╪═════════════════╡
│ CLP ┆ EUR ┆ false │
│ CLP ┆ USA ┆ false │
└───────────────┴────────────────┴─────────────────┘
shape: (2, 3)
┌───────────────┬────────────────┬─────────────────┐
│ fund_currency ┆ asset_currency ┆ currency_factor │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ bool │
╞═══════════════╪════════════════╪═════════════════╡
│ CLP ┆ EUR ┆ true │
│ CLP ┆ USA ┆ false │
└───────────────┴────────────────┴─────────────────┘
Checks
Reproducible example
holdings.csv
Log output
Issue description
When running the code if I use the optimizer one of the columns created returns only null values, if I turn off the optimizer with
.collect(comm_subexpr_elim=False)
or.collect(no_optimization=True)
as was suggested on stackoverflow the column is populatedExpected behavior
I was expecting the same result even if the optimizer is turned on.
Installed versions