pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.88k stars 1.92k forks source link

Out of bounds error for certain string values while writing/reading IPC file #18636

Closed StijnKas closed 3 weeks ago

StijnKas commented 1 month ago

Checks

Reproducible example

If I create a dummy dataframe with

  1. At least one categorical value
  2. At least one Utf8/String value
  3. Some specific string values

I seem to get a OutOfBoundsError. See some variations of fails/successes I've tried below:

import os
os.environ['POLARS_VERBOSE']='1'
import polars as pl

filename='testfile.ipc'
print("Writing, disabling memory map:")
df = pl.DataFrame(
    {
        "Test": pl.Series(["Value"], dtype=pl.Categorical),
        "Test2": pl.Series(["Value Two 205"], dtype=pl.Utf8),
        "Test3": pl.Series(["Value3"], dtype=pl.Utf8),
    }
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=False).collect()
print("This fails")
os.remove(filename)
print("Writing with memory map")
df = pl.DataFrame(
    {
        "Test": pl.Series(["Value"], dtype=pl.Categorical),
        "Test2": pl.Series(["Value Two 205"], dtype=pl.Utf8),
        "Test3": pl.Series(["Value3"], dtype=pl.Utf8),
    }
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=True).collect()
print("This fails")
os.remove(filename)
print("Writing while commenting the categorical value in the dataframe")
df = pl.DataFrame(
    {
        # "Test": pl.Series(["Value"], dtype=pl.Categorical),
        "Test2": pl.Series(["Value Two 205"], dtype=pl.Utf8),
        "Test3": pl.Series(["Value3"], dtype=pl.Utf8),
    }
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=True).collect()
print("This works")
os.remove(filename)
print('Writing with a smaller number value')
df = pl.DataFrame(
    {
        "Test": pl.Series(["Value"], dtype=pl.Categorical),
        "Test2": pl.Series(["Value Two 20"], dtype=pl.Utf8),
        "Test3": pl.Series(["Value3"], dtype=pl.Utf8),
    }
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=True).collect()
print("This works")

Log output

Writing, disabling memory map:
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]
---------------------------------------------------------------------------
OutOfBoundsError                          Traceback (most recent call last)
Cell In[2], line 15
      7 df = pl.DataFrame(
      8     {
      9         "Test": pl.Series(["Value"], dtype=pl.Categorical),
   (...)
     12     }
     13 )
     14 df.write_ipc(filename)
---> 15 written = pl.scan_ipc(filename, memory_map=False).collect()
     16 print("This fails")

File ~/Documents/Code/polars_bug/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2034, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2032 # Only for testing purposes
   2033 callback = _kwargs.get("post_opt_callback", callback)
-> 2034 return wrap_df(ldf.collect(callback))

OutOfBoundsError: view index out of bounds

Got: 0 buffers and index: 0

Writing with memory map
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[3], line 11
      3 df = pl.DataFrame(
      4     {
      5         "Test": pl.Series(["Value"], dtype=pl.Categorical),
   (...)
      8     }
      9 )
     10 df.write_ipc(filename)
---> 11 written = pl.scan_ipc(filename, memory_map=True).collect()
     12 print("This fails")

File ~/Documents/Code/polars_bug/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2034, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2032 # Only for testing purposes
   2033 callback = _kwargs.get("post_opt_callback", callback)
-> 2034 return wrap_df(ldf.collect(callback))

ComputeError: buffer's length is too small in mmap

Writing while commenting the categorical value in the dataframe
This works
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]

Writing with a smaller number value
This works
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]

Issue description

I haven't been able to narrow down the issue completely, but it appears to be related to some specific string values interacting with categoricals. It fails similarly when I turn the categorical to an Enum value as well.

This seems to be a regression, as I've tested it on version 0.20 and it works fine for me. I ran this on a clean virtual environment on the latest Polars version.

Expected behavior

This should not result in an out of bounds error.

Installed versions

``` --------Version info--------- Polars: 1.6.0 Index type: UInt32 Platform: macOS-14.6.1-arm64-arm-64bit Python: 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio 1.6.0 numpy openpyxl pandas pyarrow pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
coastalwhite commented 3 weeks ago

A bisect shows that this was caused by #17084.

@ritchie46 could you have a look at this?

ritchie46 commented 3 weeks ago

Yes