pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.51k stars 1.98k forks source link

support +vL datatype in rust #18388

Open ggggggggg opened 2 months ago

ggggggggg commented 2 months ago

Description

I'm working with large amounts of data (sometimes more than 100 of GB) which contain timestreams. Within the timestreams, there are interesting events I would like to look at. I want to have a dataframe containing these events as well as some associated information. pyarrow can create a LargeListViewArray which is essentially an array of offsets into one contiguous array to represent these events without copying data. But polars does not support the +vL or LargeListViewArray datatype. My request is to support this, I'm hoping it's not too difficult.

Some code showing the structure of my data, the creation of a LargeListViewArray and an attempt to make a polars Series from it:

record_len = 1000 # record length
source_data = np.arange(1000000, dtype=np.int64) # all records exist in this array with unpredictable offsets
inds0 = np.arange(0,len(source_data), 10000) # temporary variable to calculate offsets
offsets = inds0+np.random.randint(0,1000, len(inds0)) # mimic unpredictability of record offsets
pa_records = pa.LargeListViewArray.from_arrays(offsets, [record_len]*len(offsets), source_data)

# throws "ComputeError('The datatype "+vL" is still not supported in Rust implementation')"
series = pl.Series("record", pa_records)
ggggggggg commented 2 months ago

I've had some more time to learn about the Apache Arrow Columnar format and the various layout within. Based on this I will expand the argument for supporting the +vL format, which corresponds to the 64 bitvariant of the ListView Layout. And in particular, polars should not just read the format but offer a native datatype for columns with this memory backing.

The ListView layout says "in contrast to the List layout, list lengths are stored explicitly in the sizes buffer instead of inferred. This allows offsets to be out of order. Elements of the child array do not have to be stored in the same order they logically appear in the list elements of the parent array."

I argue that supporting this layout is very well aligned with the polars philosophy laid out at docs.polars.rs.

  1. Columns based on ListView layout could be sorted by sorting the offsets without copying the underlying data. For a DataFrame with an Array column of length 5000 and 20,000 rows the time to is over 500 ms and allocates 800 MB of memory. In comparison, sorting just the offsets would take less than 1 ms and allocate nearly zero memory. This is a massive improvement based on reducing unnecessary work, which is well aligned with "Optimizes queries to reduce unneeded work/memory allocations." For my applications I would like to use more than 1 million rows with arrays of that size, and polars is currently not well matched to that work load.
  2. Columns based on ListView layout can point to a buffer created by memmapping a file. This combination works with py-arrow in my limited testing. This is a powerful method to enable working with larger than RAM data, which is well aligned with "Handles datasets much larger than your available RAM.".
  3. Supporting a larger subset of the Apache Arrow Columnar format will increase the range of interop, which is one of the primary goals of the Apache Arrow project.

It's also now more clear to me that this probably not a small amount of work since it requires supporting a new memory layout. How much work is this? I would be interested in sponsoring this work if there are options for sponsoring contributions.