pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.58k stars 1.98k forks source link

assert_series_equal and assert_frame_equal are inconsistent #18389

Open jcmuel opened 3 months ago

jcmuel commented 3 months ago

Checks

Reproducible example

import polars as pl
from polars.testing import assert_series_not_equal, assert_frame_not_equal

def test_assert_series_equal_is_consistent() -> None:
    sr1 = pl.Series([{'a': True}, {'a': None}, None])
    sr2 = pl.Series([{'a': True}, {'a': None}, {'a': None}])
    assert sr1.dtype == sr2.dtype and sr1.dtype == pl.Struct({'a': pl.Boolean})

    # The following assertion passes and confirms that the two frames below differ.
    sr1_in_list = pl.DataFrame([{'c': sr1}])
    sr2_in_list = pl.DataFrame([{'c': sr2}])
    assert_frame_not_equal(sr1_in_list, sr2_in_list)

    # This assertion fails in polars 1.5.0, even though the series differ.
    assert_series_not_equal(sr1, sr2)

Log output

Launching pytest with arguments unit_tests/test_polars/test_testing.py::test_assert_series_equal_is_consistent --no-header --no-summary -q in test/python

============================= test session starts ==============================
collecting ... collected 1 item

unit_tests/test_polars/test_testing.py::test_assert_series_equal_is_consistent 

======================== 1 failed, 4 warnings in 0.17s =========================
FAILED [100%]
test/python/unit_tests/test_polars/test_testing.py:4 (test_assert_series_equal_is_consistent)
def test_assert_series_equal_is_consistent() -> None:
        sr1 = pl.Series([{'a': True}, {'a': None}, None])
        sr2 = pl.Series([{'a': True}, {'a': None}, {'a': None}])
        assert sr1.dtype == sr2.dtype and sr1.dtype == pl.Struct({'a': pl.Boolean})

        # The following assertion passes and confirms that the two frames below differ.
        sr1_in_list = pl.DataFrame([{'c': sr1}])
        sr2_in_list = pl.DataFrame([{'c': sr2}])
        assert_frame_not_equal(sr1_in_list, sr2_in_list)

        # This assertion fails in polars 1.5.0, even though the series differ.
>       assert_series_not_equal(sr1, sr2)

unit_tests/test_polars/test_testing.py:16: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (shape: (3,)
Series: '' [struct[1]]
[
    {true}
    {null}
    null
], shape: (3,)
Series: '' [struct[1]]
[
    {true}
    {null}
    {null}
])
kwargs = {}

    @wraps(function)
    def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
        _rename_keyword_argument(
            old_name, new_name, kwargs, function.__qualname__, version
        )
>       return function(*args, **kwargs)
E       AssertionError: Series are equal

python3.12/site-packages/polars/_utils/deprecation.py:91: AssertionError

Process finished with exit code 1

Issue description

The test first creates two DataFrames. Each of them contains a single column of datatype list of struct and a single row. One list contains a Null element whereas the other one contains a struct with all Null values. All other elements are the same.

This is a contradiction.

Expected behavior

The most consistent expected behavior seems to be to treat Null and a struct with all Null-fields as two different things, in the same way as assert_frame_not_equal is already doing it.

The test should pass.

Installed versions

``` --------Version info--------- Polars: 1.5.0 Index type: UInt32 Platform: macOS-14.6.1-arm64-arm-64bit Python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:54:21) [Clang 16.0.6 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.2 pydantic: 2.6.4 pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
jcmuel commented 3 months ago

In this issue (#18389), I combined the examples from #18119 and #18230, to show that assert_frame_equal and assert_series_equal are inconsistent in the way how they are treating Null structs and structs with all Null fields.

In #18230, @cmdlineluser suggested that the new behavior of list.drop_nulls() should now be considered to be correct, and that function treats Null and {Null} as two different things.

But in #18119, it seems like the conclusion was that assert_series_equal should treat Null and {Null} as the same.

This creates inconsistencies and this bug report identifies one of them.