oracle / graalpython

GraalPy – A high-performance embeddable Python 3 runtime for Java
https://www.graalvm.org/python/
Other
1.25k stars 111 forks source link

Support for Apache Arrow data representations #436

Open fniephaus opened 1 month ago

fniephaus commented 1 month ago

TL;DR

Recently, many Python libraries have integrated Apache Arrow to leverage its high-performance and memory-efficient format for handling large datasets. Given the widespread adoption of Arrow in data science and big data ecosystems, we plan to add Apache Arrow support to GraalPy.

Goals

The primary goal of adding Apache Arrow support to GraalPy is to enhance interoperability and performance while working with libraries such as Pandas. List-like structures in GraalPy will be backed by the Apache Arrow format, allowing seamless integration with those libraries while achieving zero-copy data transfers. This will enable data to be passed between GraalPy and Pandas without duplicating memory, significantly boosting performance, especially for large datasets.

Another key goal is to facilitate full interoperability with the Java implementation of Apache Arrow. This will enable users to load data in Java, execute Python-based data analysis using Pandas, and return results to Java, all without any memory copies, ensuring smooth, high-performance cross-language workflows.

Lastly, by allocating memory off-heap, GraalPy can allocate byte[] beyond the ~2GB limitation of the JVM (new byte[Integer.MAX_VALUE]), making it capable of handling much larger datasets.

Non-Goals