Open YANG-DB opened 1 month ago
@penghuo @LantaoJin can you plz review and comment ? thanks
IMO, there are two kinds of cases which will leverage opensearch-hadoop (haven't deep dive, from the description, it sounds a connector):
In the case 1, the target query is from Hadoop ecosystem (Hive, Spark, Presto). Sounds there is no requirement for enhancement here. It's what the opensearch-hadoop project did. In the case 2, the target query is from OpenSearch (DSL/SQL/PPL). My question is what is the user story of this kind of crossing query. Why not just run SQL via case 1? What problem this solution resolve? Did you hear any request from community now?
Is your feature request related to a problem? Currently, OpenSearch does not provide a way to perform cross-index or cross-cluster joins using the OpenSearch DSL. This RFC proposes to extend OpenSearch's capabilities by leveraging Spark's MPP engine through the OpenSearch Flint API and Hadoop OpenSearch Library. This will enable users to execute cross-index joins natively via Spark, while abstracting away the underlying complexity.
What solution would you like?
The Problem Statement
The OpenSearch Query engine lacks native support for cross-index (or cross-cluster) joins. This limitation hinders scenarios where users need to merge data residing in different indexes (or clusters) without manually combining the results at the application level or (in the SQL plugin case) in the OpenSearch coordinating node running the two sides of the join.
In large-scale data environments, this becomes a bottleneck for performing analytics or relational-style queries across distributed datasets.
Proposed Solution
We propose using Apache Spark's engine via the OpenSearch Flint API (using PPL Join commands) and Hadoop OpenSearch integration, to allow users to perform cross-index joins. The solution will:
Abstract the complexity of Spark: Users will be able to write a query in OpenSearch's PPL, but behind the scenes, the query will be translated into a Spark SQL query that can perform the join.
Support for relational joins: The system will allow different types of joins (INNER, LEFT OUTER, etc.) between indexes residing on the same or different clusters.
Leverage Spark’s distributed processing: By utilizing Spark's distributed architecture, the solution will ensure scalability and performance in handling large datasets.
Architecture
Point to Consider
Hive - Table
Today Spark / Flint can use to query Hive tables using OpenSearch hadoop .
Spark - SQL
Using spark SQL to read indices directly from Spark as shown here:
Do you have any additional context?