opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.83k stars 1.83k forks source link

Library changes for Apache Arrow integration #16691

Open rishabhmaurya opened 2 days ago

rishabhmaurya commented 2 days ago

Description

Library changes for Apache Arrow integration. Lib changes just exposes POJOs for creation of StreamProducer and registering them with StreamManager. StreamProducer in turn exposes BatchedJob which are based on creation and filling Arrow Vectors in a batched manner handling client backpressure. So the arrow APIs exposed are kept minimal, limited to -

  api "org.apache.arrow:arrow-vector:${versions.arrow}"
  api "org.apache.arrow:arrow-format:${versions.arrow}"
  api "org.apache.arrow:arrow-memory-core:${versions.arrow}"

server module will depend on libs:arrow to create and register StreamProducer for search results by populating vectors with a well defined schema for search results.

Future PR will contain a module modules:arrow-flight-rpc, which will actually expose the Arrow Flight server, client and actual logic to create FlightStream. It will be a bulky modules in terms of all its direct and transitive dependencies.

Sequence Diagram

Related Issues

Resolves https://github.com/opensearch-project/OpenSearch/issues/16679

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions[bot] commented 2 days ago

Hello! We have added a performance benchmark workflow that runs by adding a comment on the PR. Please refer https://github.com/opensearch-project/OpenSearch/blob/main/PERFORMANCE_BENCHMARKS.md on how to run benchmarks on pull requests.

github-actions[bot] commented 2 days ago

:x: Gradle check result for c4d07359a1726c5bc80be0839a8b36265489dc77: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] commented 2 days ago

:x: Gradle check result for fe262f287a4cdef73a470cc14ea4cdd40d179ab4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] commented 2 days ago

:x: Gradle check result for 3295caf5ac06d126e4a07507aa058c17a5f78e68: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] commented 1 day ago

:white_check_mark: Gradle check result for d053e4ac92b521fc29b7aa10fbf7d0179862554e: SUCCESS

codecov[bot] commented 1 day ago

Codecov Report

Attention: Patch coverage is 67.39130% with 15 lines in your changes missing coverage. Please review.

Project coverage is 72.16%. Comparing base (05513df) to head (b7612f6).

Files with missing lines Patch % Lines
...c/main/java/org/opensearch/arrow/StreamTicket.java 70.45% 4 Missing and 9 partials :warning:
...main/java/org/opensearch/arrow/StreamProducer.java 0.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #16691 +/- ## ============================================ + Coverage 72.11% 72.16% +0.04% - Complexity 65230 65261 +31 ============================================ Files 5318 5320 +2 Lines 303915 303961 +46 Branches 43975 43983 +8 ============================================ + Hits 219180 219364 +184 + Misses 66813 66624 -189 - Partials 17922 17973 +51 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.


🚨 Try these New Features:

github-actions[bot] commented 1 day ago

:white_check_mark: Gradle check result for b7612f6b748a20508c95f12bee5751be39964853: SUCCESS

reta commented 1 day ago

On the first pass, it looks more like an SPI than a library (there is no implementation provided), where the implementations would live? Does it make sense to split it into arrow-spi / arrow-core modules?

rishabhmaurya commented 1 day ago

@reta you're right. Other than StreamTicket everything else are interfaces. I like the idea of naming it arrow-spi. All implementations will be part of module:arrow-flight-rpc its currently part of my feature branch.