Support for Apache Arrow data representations

TL;DR

Recently, many Python libraries have integrated Apache Arrow to leverage its high-performance and memory-efficient format for handling large datasets. Given the widespread adoption of Arrow in data science and big data ecosystems, we plan to add Apache Arrow support to GraalPy.

Goals

The primary goal of adding Apache Arrow support to GraalPy is to enhance interoperability and performance while working with libraries such as Pandas. List-like structures in GraalPy will be backed by the Apache Arrow format, allowing seamless integration with those libraries while achieving zero-copy data transfers. This will enable data to be passed between GraalPy and Pandas without duplicating memory, significantly boosting performance, especially for large datasets.

Another key goal is to facilitate full interoperability with the Java implementation of Apache Arrow. This will enable users to load data in Java, execute Python-based data analysis using Pandas, and return results to Java, all without any memory copies, ensuring smooth, high-performance cross-language workflows.

Lastly, by allocating memory off-heap, GraalPy can allocate byte[] beyond the ~2GB limitation of the JVM (new byte[Integer.MAX_VALUE]), making it capable of handling much larger datasets.

Non-Goals

Replacement of existing data structures. The goal is not to replace all existing data structures. Only specific use cases will benefit from this integration.
Memory optimization. The focus is not on optimizing memory usage or speeding up operations on the existing structures.
Addressing other JVM constraints. While the off-heap memory allocation helps bypass the JVM's 2GB limitation, addressing other JVM-related memory constraints is not in the scope of this integration.

oracle / graalpython