pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.56k stars 1.98k forks source link

LazyFrame.with_context() broadcasting #19090

Open adamgreg opened 1 month ago

adamgreg commented 1 month ago

Checks

Reproducible example

import polars as pl

ctx_lf = pl.LazyFrame({'a': [None]})
print(pl.LazyFrame({'b': [1, 2]}).with_context(ctx_lf).select('a', 'b').collect())

Log output

Traceback (most recent call last):
  File "C:\Users\a5128321\AppData\Roaming\JetBrains\PyCharmCE2024.2\scratches\scratch_95.py", line 4, in <module>
    print(pl.LazyFrame({'b': [1, 2]}).with_context(ctx_lf).select('a', 'b').collect())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "P:\venv-311\Lib\site-packages\polars\lazyframe\frame.py", line 2050, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: Series: a, length 1 doesn't match the DataFrame height of 2

If you want this Series to be broadcasted, ensure it is a scalar (for instance by adding '.first()').

Issue description

From Polars 1.7 onwards (reproduced in 1.9.0 and 1.7.0), there has been a change in the broadcasting behaviour of LazyFrame.with_context(). Previously, you could use a single-row LF to provide "default" values for columns missing in the "main" LF, regardless of its height. Now there is an exception about the difference in heights. With version 1.6.0 there is no problem.

I rely upon this behaviour in an internal package I maintain. There is a function which needs to provide default values for input tables that may have missing columns, as well as add columns with repeated values taken from another DF. LazyFrame.with_context() has previously worked well for this purpose. Since it's now deprecated, I'm keen to move to an alternative solution, but I'm not sure how.. I don't think concat(how="horizontal") will work, where columns may be duplicated, and broadcasting of a single row is required.

Thanks for reading. I'm a huge fan of Polars, and have been evangelizing it more than ever since the API stabilized!

Expected behavior

import polars as pl

# Context LazyFrame, used to provide "default" values where a column is missing
ctx_lf = pl.LazyFrame({'a': [3], 'b': [None]})

print(pl.LazyFrame({'a': [1, 2], 'b': [1, 2]}).with_context(ctx_lf).select('a', 'b').collect())
# No exception, result unaffected by the context

print(pl.LazyFrame({'b': [1, 2]}).with_context(ctx_lf).select('a', 'b').collect())
# An "a" column in the result, with the value 3 in both rows

print(pl.LazyFrame({'a': [1, 2]}).with_context(ctx_lf).select('a', 'b').collect())
# A "b" column in the result, with null in both rows

Installed versions

``` --------Version info--------- Polars: 1.9.0 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio 1.6.0 numpy 1.24.4 openpyxl 3.1.2 pandas 1.5.3 pyarrow 10.0.1 pydantic 2.7.1 pyiceberg sqlalchemy 2.0.34 torch xlsx2csv 0.8.2 xlsxwriter 3.2.0 ```
cmdlineluser commented 1 month ago

It sort of reminds me of an "anti" join - or update() but with DIFFERENCE instead of UNION.

(Not sure if a concat(..., how="anti") would make sense?)

It seems like one would have to manually find the difference in this case and forward fill?

default = pl.LazyFrame({'a': [3], 'b': [None]})

tests = [
    {'a': [1, 2], 'b': [1, 2]},
    {'b': [1, 2]},
    {'a': [1, 2]}
]

for test in tests:
    lf = pl.LazyFrame(test)
    names = default.collect_schema().keys() - lf.collect_schema().keys() 

    (pl.concat([lf, default.select(names)], how='horizontal')
       .with_columns(pl.col(names).forward_fill())
       .select('a', 'b')
       .collect()
    )
# {'a': [1, 2], 'b': [1, 2]}
# {'a': [3, 3], 'b': [1, 2]}
# {'a': [1, 2], 'b': [None, None]}
adamgreg commented 1 month ago

Thanks @cmdlineluser, that's very interesting. The real case is complicated a little more by the fact that what is ultimately selected can be arbitrary passed-in expressions that may cut across multiple sources. I think Polars has better support for introspection and relaxed concatenation since I wrote the original code though, so I can probably treat this as an opportunity to simplify!