uccross / skyhookdm-arrow

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.
https://arrow.apache.org
Apache License 2.0
10 stars 7 forks source link

ARROW-12921: [C++][Dataset] Add RadosParquetFileFormat to Dataset API #146

Closed JayjeetAtGithub closed 3 years ago

JayjeetAtGithub commented 3 years ago

The implementation includes a new RadosParquetFileFormat class that inherits from the ParquetFileFormat class to defer the evaluation of scan operations on a Parquet dataset to a RADOS storage backend. This new file format plugs into the FileSystemDataset API, converts filenames to object IDs using FS metadata and uses the librados C++ library to execute storage side functions that scan the files on the Ceph storage nodes (OSDs) using Arrow libraries. We ship unit and integration tests with our implementation where the tests are run against a single-node Ceph cluster.

The storage-side code is implemented as a RADOS CLS (object storage class) using Ceph's Object Class SDK. The code lives in cpp/src/arrow/adapters/arrow-rados-cls, and is expected to be deployed on the storage nodes (Ceph's OSDs) prior to operating on tables via the RadosParquetFileFormat implementation. This PR includes a CMake configuration for building this library if desired (ARROW_CLS CMake option). We have also added Python bindings for our C++ implementations and added integration tests for them.

github-actions[bot] commented 3 years ago

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] commented 3 years ago

https://issues.apache.org/jira/browse/ARROW-12921