[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
If I create a dummy dataframe with
At least one categorical value
At least one Utf8/String value
Some specific string values
I seem to get a OutOfBoundsError. See some variations of fails/successes I've tried below:
import os
os.environ['POLARS_VERBOSE']='1'
import polars as pl
filename='testfile.ipc'
print("Writing, disabling memory map:")
df = pl.DataFrame(
{
"Test": pl.Series(["Value"], dtype=pl.Categorical),
"Test2": pl.Series(["Value Two 205"], dtype=pl.Utf8),
"Test3": pl.Series(["Value3"], dtype=pl.Utf8),
}
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=False).collect()
print("This fails")
os.remove(filename)
print("Writing with memory map")
df = pl.DataFrame(
{
"Test": pl.Series(["Value"], dtype=pl.Categorical),
"Test2": pl.Series(["Value Two 205"], dtype=pl.Utf8),
"Test3": pl.Series(["Value3"], dtype=pl.Utf8),
}
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=True).collect()
print("This fails")
os.remove(filename)
print("Writing while commenting the categorical value in the dataframe")
df = pl.DataFrame(
{
# "Test": pl.Series(["Value"], dtype=pl.Categorical),
"Test2": pl.Series(["Value Two 205"], dtype=pl.Utf8),
"Test3": pl.Series(["Value3"], dtype=pl.Utf8),
}
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=True).collect()
print("This works")
os.remove(filename)
print('Writing with a smaller number value')
df = pl.DataFrame(
{
"Test": pl.Series(["Value"], dtype=pl.Categorical),
"Test2": pl.Series(["Value Two 20"], dtype=pl.Utf8),
"Test3": pl.Series(["Value3"], dtype=pl.Utf8),
}
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=True).collect()
print("This works")
Log output
Writing, disabling memory map:
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]
---------------------------------------------------------------------------
OutOfBoundsError Traceback (most recent call last)
Cell In[2], line 15
7 df = pl.DataFrame(
8 {
9 "Test": pl.Series(["Value"], dtype=pl.Categorical),
(...)
12 }
13 )
14 df.write_ipc(filename)
---> 15 written = pl.scan_ipc(filename, memory_map=False).collect()
16 print("This fails")
File ~/Documents/Code/polars_bug/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2034, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
2032 # Only for testing purposes
2033 callback = _kwargs.get("post_opt_callback", callback)
-> 2034 return wrap_df(ldf.collect(callback))
OutOfBoundsError: view index out of bounds
Got: 0 buffers and index: 0
Writing with memory map
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
Cell In[3], line 11
3 df = pl.DataFrame(
4 {
5 "Test": pl.Series(["Value"], dtype=pl.Categorical),
(...)
8 }
9 )
10 df.write_ipc(filename)
---> 11 written = pl.scan_ipc(filename, memory_map=True).collect()
12 print("This fails")
File ~/Documents/Code/polars_bug/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2034, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
2032 # Only for testing purposes
2033 callback = _kwargs.get("post_opt_callback", callback)
-> 2034 return wrap_df(ldf.collect(callback))
ComputeError: buffer's length is too small in mmap
Writing while commenting the categorical value in the dataframe
This works
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]
Writing with a smaller number value
This works
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]
Issue description
I haven't been able to narrow down the issue completely, but it appears to be related to some specific string values interacting with categoricals. It fails similarly when I turn the categorical to an Enum value as well.
This seems to be a regression, as I've tested it on version 0.20 and it works fine for me. I ran this on a clean virtual environment on the latest Polars version.
Checks
Reproducible example
If I create a dummy dataframe with
I seem to get a OutOfBoundsError. See some variations of fails/successes I've tried below:
Log output
Issue description
I haven't been able to narrow down the issue completely, but it appears to be related to some specific string values interacting with categoricals. It fails similarly when I turn the categorical to an Enum value as well.
This seems to be a regression, as I've tested it on version 0.20 and it works fine for me. I ran this on a clean virtual environment on the latest Polars version.
Expected behavior
This should not result in an out of bounds error.
Installed versions