Open ggggggggg opened 2 months ago
I've had some more time to learn about the Apache Arrow Columnar format and the various layout within. Based on this I will expand the argument for supporting the +vL
format, which corresponds to the 64 bitvariant of the ListView Layout. And in particular, polars should not just read the format but offer a native datatype for columns with this memory backing.
The ListView layout
says "in contrast to the List layout, list lengths are stored explicitly in the sizes buffer instead of inferred. This allows offsets to be out of order. Elements of the child array do not have to be stored in the same order they logically appear in the list elements of the parent array."
I argue that supporting this layout is very well aligned with the polars philosophy laid out at docs.polars.rs.
ListView layout
could be sorted by sorting the offsets without copying the underlying data. For a DataFrame
with an Array
column of length 5000 and 20,000 rows the time to is over 500 ms and allocates 800 MB of memory. In comparison, sorting just the offsets would take less than 1 ms and allocate nearly zero memory. This is a massive improvement based on reducing unnecessary work, which is well aligned with "Optimizes queries to reduce unneeded work/memory allocations." For my applications I would like to use more than 1 million rows with arrays of that size, and polars is currently not well matched to that work load.ListView layout
can point to a buffer created by memmapping a file. This combination works with py-arrow
in my limited testing. This is a powerful method to enable working with larger than RAM data, which is well aligned with "Handles datasets much larger than your available RAM.".It's also now more clear to me that this probably not a small amount of work since it requires supporting a new memory layout. How much work is this? I would be interested in sponsoring this work if there are options for sponsoring contributions.
Description
I'm working with large amounts of data (sometimes more than 100 of GB) which contain timestreams. Within the timestreams, there are interesting events I would like to look at. I want to have a dataframe containing these events as well as some associated information. pyarrow can create a
LargeListViewArray
which is essentially an array of offsets into one contiguous array to represent these events without copying data. But polars does not support the+vL
orLargeListViewArray
datatype. My request is to support this, I'm hoping it's not too difficult.Some code showing the structure of my data, the creation of a
LargeListViewArray
and an attempt to make a polarsSeries
from it: