[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
lf = pl.scan_parquet(some_big_paruqet_file).filter(some_filter_conditions)
df = pl.concat([df.with_columns(x=x) for x in (0, 0)]).collect()
# df.select("x") now has non-zero values
Log output
No response
Issue description
I'm seeing garbage values from pl.concat non-deterministically when it concats the same dataframe multiple times. It seems to be triggered more frequently if PYTHONMALLOC=malloc is used. Unfortunately, I can't give a repro here since I can't reproduce this consistently using a dummy dataset.
I did run valgrind and got the following (also non-deterministically)
valgrind --tool=memcheck python test.py
==64044== Conditional jump or move depends on uninitialised value(s)
==64044== at 0x204F205D8: polars_core::chunked_array::ops::append::update_sorted_flag_before_append (in /venvs/polars/polars.abi3.so)
==64044== by 0x2051F1ADB: <polars_core::series::implementations::SeriesWrap<polars_core::chunked_array::ChunkedArray<polars_core::datatypes::Int64Type>> as polars_core::series::series_trait::SeriesTrait>::append (in /venvs/polars/polars.abi3.so)
==64044== by 0x20571CF89: polars_core::series::Series::append (in /venvs/polars/polars.abi3.so)
==64044== by 0x205CEF0A4: polars_core::utils::accumulate_dataframes_vertical_unchecked (in /venvs/polars/polars.abi3.so)
==64044== by 0x205EDFBDE: polars_mem_engine::executors::stack::StackExec::execute_impl (in /venvs/polars/polars.abi3.so)
==64044== by 0x205EDEB94: <polars_mem_engine::executors::stack::StackExec as polars_mem_engine::executors::executor::Executor>::execute (in /venvs/polars/polars.abi3.so)
==64044== by 0x205CCE11E: polars_lazy::frame::LazyFrame::collect (in /venvs/polars/polars.abi3.so)
==64044== by 0x207C92F3D: polars_python::lazyframe::general::<impl polars_python::lazyframe::PyLazyFrame>::__pymethod_collect__ (in /venvs/polars/polars.abi3.so)
==64044== by 0x2074BC1E7: pyo3::impl_::trampoline::trampoline (in /venvs/polars/polars.abi3.so)
==64044== by 0x207C73710: polars_python::lazyframe::general::_::__INVENTORY::trampoline (in /venvs/polars/polars.abi3.so)
==64044== by 0x49EC31D: method_vectorcall_VARARGS_KEYWORDS (descrobject.c:344)
==64044== by 0x49D9D9D: UnknownInlinedFun (abstract.h:114)
==64044== by 0x49D9D9D: UnknownInlinedFun (abstract.h:123)
==64044== by 0x49D9D9D: UnknownInlinedFun (ceval.c:5967)
==64044== by 0x49D9D9D: _PyEval_EvalFrameDefault (ceval.c:4266)
Checks
Reproducible example
Log output
No response
Issue description
I'm seeing garbage values from
pl.concat
non-deterministically when it concats the same dataframe multiple times. It seems to be triggered more frequently ifPYTHONMALLOC=malloc
is used. Unfortunately, I can't give a repro here since I can't reproduce this consistently using a dummy dataset.I did run valgrind and got the following (also non-deterministically)
I know very little about rust so this is just my random guess. I'm wondering whether array append works if both self and other point to the same array (i.e. concat the same array twice case above)? https://github.com/pola-rs/polars/blob/577cd62d30361a0e03a018e8761aae7dceedd784/crates/polars-core/src/chunked_array/ops/append.rs#L199 Specifically, in this line if
chunks
vec gets resized inpush
, willother
point to some potentially freed memory region ifchunks
andother
are the same array?Expected behavior
columns
x
in the example should have all zero values.Installed versions