miranov25 / RootInteractive

5 stars 12 forks source link

Support - Interface for apache arrow (Alice Run3) #176

Open miranov25 opened 2 years ago

miranov25 commented 2 years ago

Options:

Links:

https://arrow.apache.org/docs/python/pandas.html https://arrow.apache.org/docs/python/getstarted.html

Memory Usage and Zero Copy

When converting from Arrow data structures to pandas objects using various to_pandas methods, one must occasionally be mindful of issues related to performance and memory usage.

Since pandas’s internal data representation is generally different from the Arrow columnar format, zero copy conversions (where no memory allocation or computation is required) are only possible in certain limited cases.

In the worst case scenario, calling to_pandas will result in two versions of the data in memory, one for Arrow and one for pandas, yielding approximately twice the memory footprint. We have implement some mitigations for this case, particularly when creating large DataFrame objects, that we describe below.

Zero Copy Series Conversions

Zero copy conversions from Array or ChunkedArray to NumPy arrays or pandas Series are possible in certain narrow cases:

The Arrow data is stored in an integer (signed or unsigned int8 through int64) or floating point type (float16 through float64). This includes many numeric types as well as timestamps.

The Arrow data has no null values (since these are represented using bitmaps which are not supported by pandas).

For ChunkedArray, the data consists of a single chunk, i.e. arr.num_chunks == 1. Multiple chunks will always require a copy because of pandas’s contiguousness requirement.

In these scenarios, to_pandas or to_numpy will be zero copy. In all other scenarios, a copy will be required.