pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.97k stars 1.93k forks source link

Garbage values from pl.concat #19286

Open yiteng-guo opened 1 week ago

yiteng-guo commented 1 week ago

Checks

Reproducible example

lf = pl.scan_parquet(some_big_paruqet_file).filter(some_filter_conditions)
df = pl.concat([df.with_columns(x=x) for x in (0, 0)]).collect()
# df.select("x") now has non-zero values

Log output

No response

Issue description

I'm seeing garbage values from pl.concat non-deterministically when it concats the same dataframe multiple times. It seems to be triggered more frequently if PYTHONMALLOC=malloc is used. Unfortunately, I can't give a repro here since I can't reproduce this consistently using a dummy dataset.

I did run valgrind and got the following (also non-deterministically)

valgrind --tool=memcheck python test.py

==64044== Conditional jump or move depends on uninitialised value(s)
==64044==    at 0x204F205D8: polars_core::chunked_array::ops::append::update_sorted_flag_before_append (in /venvs/polars/polars.abi3.so)
==64044==    by 0x2051F1ADB: <polars_core::series::implementations::SeriesWrap<polars_core::chunked_array::ChunkedArray<polars_core::datatypes::Int64Type>> as polars_core::series::series_trait::SeriesTrait>::append (in /venvs/polars/polars.abi3.so)
==64044==    by 0x20571CF89: polars_core::series::Series::append (in /venvs/polars/polars.abi3.so)
==64044==    by 0x205CEF0A4: polars_core::utils::accumulate_dataframes_vertical_unchecked (in /venvs/polars/polars.abi3.so)
==64044==    by 0x205EDFBDE: polars_mem_engine::executors::stack::StackExec::execute_impl (in /venvs/polars/polars.abi3.so)
==64044==    by 0x205EDEB94: <polars_mem_engine::executors::stack::StackExec as polars_mem_engine::executors::executor::Executor>::execute (in /venvs/polars/polars.abi3.so)
==64044==    by 0x205CCE11E: polars_lazy::frame::LazyFrame::collect (in /venvs/polars/polars.abi3.so)
==64044==    by 0x207C92F3D: polars_python::lazyframe::general::<impl polars_python::lazyframe::PyLazyFrame>::__pymethod_collect__ (in /venvs/polars/polars.abi3.so)
==64044==    by 0x2074BC1E7: pyo3::impl_::trampoline::trampoline (in /venvs/polars/polars.abi3.so)
==64044==    by 0x207C73710: polars_python::lazyframe::general::_::__INVENTORY::trampoline (in /venvs/polars/polars.abi3.so)
==64044==    by 0x49EC31D: method_vectorcall_VARARGS_KEYWORDS (descrobject.c:344)
==64044==    by 0x49D9D9D: UnknownInlinedFun (abstract.h:114)
==64044==    by 0x49D9D9D: UnknownInlinedFun (abstract.h:123)
==64044==    by 0x49D9D9D: UnknownInlinedFun (ceval.c:5967)
==64044==    by 0x49D9D9D: _PyEval_EvalFrameDefault (ceval.c:4266)

I know very little about rust so this is just my random guess. I'm wondering whether array append works if both self and other point to the same array (i.e. concat the same array twice case above)? https://github.com/pola-rs/polars/blob/577cd62d30361a0e03a018e8761aae7dceedd784/crates/polars-core/src/chunked_array/ops/append.rs#L199 Specifically, in this line if chunks vec gets resized in push, will other point to some potentially freed memory region if chunks and other are the same array?

Expected behavior

columns x in the example should have all zero values.

Installed versions

``` In [17]: pl.show_versions() --------Version info--------- Polars: 1.9.0 Index type: UInt32 Python: 3.10.13 ```
ritchie46 commented 1 week ago

Can you create an MRE? There is nothing we can do without a reproduction.