Closed JayjeetAtGithub closed 3 years ago
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename pull request title in the following format?
ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
The implementation includes a new
RadosParquetFileFormat
class that inherits from theParquetFileFormat
class to defer the evaluation of scan operations on a Parquet dataset to a RADOS storage backend. This new file format plugs into theFileSystemDataset
API, converts filenames to object IDs using FS metadata and uses the librados C++ library to execute storage side functions that scan the files on the Ceph storage nodes (OSDs) using Arrow libraries. We ship unit and integration tests with our implementation where the tests are run against a single-node Ceph cluster.The storage-side code is implemented as a RADOS CLS (object storage class) using Ceph's Object Class SDK. The code lives in
cpp/src/arrow/adapters/arrow-rados-cls
, and is expected to be deployed on the storage nodes (Ceph's OSDs) prior to operating on tables via theRadosParquetFileFormat
implementation. This PR includes a CMake configuration for building this library if desired (ARROW_CLS
CMake option). We have also added Python bindings for our C++ implementations and added integration tests for them.