vamsi-amazon commented 9 months ago

SQL/PPL via DSL in Search API.

1. Problem Statement.

Today, OpenSearch offers support for SQL and PPL query languages through the plugin endpoints _plugins/sql and _plugins/ppl. However, clients using OpenSearch client libraries face limitations, as these libraries do not accommodate with plugin endpoints.To increase adoption with minimal disruption, our proposal introduces new SQL and PPL clauses directly into the SearchRequest body. This approach aims to facilitate the use of these languages through the Search API, streamlining access and integration for users.

2. Summary

Add response selector to _search API (e.g. "result_format":"hit_object" vs "result_format":"datarow" ) to support existing format or datarows. By default we still use hit_object.
Add SQL query support to _search API. If you send a SQL query and don’t explicitly specify result_format, the format defaults to datarow.

From user perspective, the following example demonstrate SQL vis DSL request and response.

### Request
POST {{baseUrl}}/_search
Content-Type: application/x-ndjson

{
  "sql": {
    "query": "select 1"
  }
}

### Resonse
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "datarows": {
    "schema": [
      {
        "name": "1",
        "type": "integer"
      }
    ],
    "datarows": [
      [
        1
      ]
    ],
    "total": 1,
    "size": 1
  }
}

In this doc, we will discuss detailed design, limitations, and development plan.

3. Tenets:

Minimal disruption to the existing use cases and SQL Plugin APIs.
- This helps in continuing existing support to Observability Plugins, JDBC, ODBC drivers.
Minimal duplication of code and maintenance across different use cases.
Query execution should uphold the security polices defined in the execution context.
The new functionality should be supported for both transport and rest high level clients. Any changes made should be under the transport layer but not in Rest layer.

4. Solution

4.1.Search API

4.1.1. Endpoint

Category	Method	Path	SQL Support	Description
Search	GET	/target-index/_search	No Support	SQL FROM clause specify the index
Search	GET	/_search	Support
Search	POST	/target-index/_search	No Support	SQL FROM clause specify the index
Search	POST	/_search	Support
Scroll	ALL	ALL	No Support	SQL use LIMIT and OFFSET retrieve a portion of the rows. The syntax is not aligned with scroll API.
Multi-Search	GET	_msearch	Support
Multi-Search	GET	/target-indices/_msearch	No Support	SQL FROM clause specify the indices
Multi-Search	POST	_msearch	Support
Multi-Search	POST	/target-indices/_msearch	No Support	SQL FROM clause specify the indices

DSL is being used in above APIs majorly and also asynchronous search. Validation exception would be thrown whenever a sql block is encountered in above unsupported APIs.

TransportSearchAction
TransportMultiSearchAction
TransportSearchScrollAction
TransportSubmitAsynchronousSearchAction : Since this is just a wrapper over TransportSearchAction, it should be handled automatically.

4.1.2.URL Parameters

All URL parameters are not supported.

4.1.3.Request Body

Sample Request Body:

localhost:9200/_search
{
   "ppl" : {
        "query" : "source = accounts"
    }
}

OR

{
   "sql" : {
        "query" : "select * from accounts"
    }
}

Field	Type	Description	SQL
aggs	Object	In the optional `aggs` parameter, you can define any number of aggregations. Each aggregation is defined by its name and one of the types of aggregations that OpenSearch supports. For more information, see Aggregations.	No Support
docvalue_fields	Array of objects	The fields that OpenSearch should return using their docvalue forms. Specify a format to return results in a certain format, such as date and time.	No Support
fields	Array	The fields to search for in the request. Specify a format to return results in a certain format, such as date and time.	No Support
explain	String	Whether to return details about how OpenSearch computed the document’s score. Default is false.	No Support
from	Integer	The starting index to search from. Default is 0.	No Support
indices_boost	Array of objects	Values used to boost the score of specified indexes. Specify in the format of :	No Support
min_score	Integer	Specify a score threshold to return only documents above the threshold.	No Support
query	Object	The DSL query to use in the request.	No Support
seq_no_primary_term	Boolean	Whether to return sequence number and primary term of the last operation of each document hit.	No Support
size	Integer	How many results to return. Default is 10.	No Support
_source		Whether to include the `_source` field in the response.	No Support
stats	String	Value to associate with the request for additional logging.	No Support
terminate_after	Integer	The maximum number of documents OpenSearch should process before terminating the request. Default is 0.	No Support
timeout	Time	How long to wait for a response. Default is no timeout.	Support
version	Boolean	Whether to include the document version in the response.	No Support
sql	Object		New Field

4.1.4.Response

If query type is SQL, response format is datarows. query response include datarows section. datarows section include schema and datarows, for example

{  
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "**datarows_output**" : {
  "schema" : [
    {
      "name" : "firstname",
      "type" : "text"
    },
    {
      "name" : "lastname",
      "type" : "text"
    },
    {
      "name" : "age",
      "type" : "long"
    }
  ],
  "datarows" : [
    [
      "Nanette",
      "Bates",
      28
    ],
    [
      "Amber",
      "Duke",
      32
    ]
  ]
  }
}

4.2.Feature Parity - SQL vs DSL

Currently, SQL does not fully support all DSL query and aggregations. The following table highlight key query features missing support in SQL

Category	SQL
Compound	Support
Full text queries	Support
Geo queries	No Support
Shape queries	No Support
Joinning queries	No Support
Span queires	No Support
Specialized queries	No Support
Term-Level queries	Support

and metrics aggregation function missing in SQL

Category	SQL
Geo-bounds	No Support
Gen-centroid	No Support
Percentils	No Support
Rate	No Support
T-test	No Support

4.3.Performance

SQL queries through the search endpoint should offer performance comparable to that of DSL queries. Users should not experience any degradation in performance. We use OpenSearch benchmark framework to compare DSL query and SQL via Search.

4.4.Client

All OpenSearch client should support SQL query and datarow response. We have three different types of clients.

NodeClient. Changing SearchResponse would automatically update this client. [Alerting Plugin used NodeClient to execute Search]
RestHighLevelClient Changing SearchResponse would automatically update this client. [Most Plugins use this client]
Language Clients - opensearch-java, opensearch-py, opensearch-go etc.
- https://github.com/opensearch-project/opensearch-api-specification/issues/189
  - Switching to OpenAPI. We should wait till the switch happens and make those changes in opensearch-api-specification.
- https://github.com/opensearch-project/opensearch-java

4.5.Security

Calls to SQL via _search include index names in the request body, so they have the same access policy considerations as the bulk, mget, and msearch operations.

5. Detailed Design.

5.1 Approach 1: Extend SearchPlugin Interface and Integrate with SQL plugin.

In this approach, we will add a new function to the existing SearchPlugin interface to introduce a new construct called QueryEngineSpec. Plugins can implement the SearchPlugin interface to introduce a query engine that takes over the search request to produce the SearchResponse. QueryEngineSpec defines the name of the spec, which will be the key in the DSL under which the respective query engine request parameters are enclosed. Once OpenSearch-core receives a request with a clause containing a key defined by a QueryEngineSpec, OpenSearch-core creates the Query Engine and transfers the request to the plugin via the Query Engine Object.

We would be introducing a new field data_rows_output in InternalSearchResponse. SQL, PPL Plugins would populate this field and also other meta information of took, shards, timed_out information. In case of normal DSL query, hits object would be formulated.

#### 5.2 Approach 2: Leverage Search Pipelines and introduce new SQL/PPL processors. In the current state, search pipelines offer the following types of processors: * Request Processors → Transform SearchRequest. * Response Processors → Transform SearchResponse. * Search Phase Results Processors → Transform SearchResults between the query and fetch phases. [The Flow framework](https://github.com/opensearch-project/flow-framework/issues/475) proposes to include a new type of search processor in Pipelines. This search processor would take a SearchRequest and produce a SearchResponse. We can leverage this new Search Processor type to introduce SQLSearchProcessor and PPLSearchProcessor, which take over the Search Request whenever there is an SQL and PPL block in the search request, respectively. Since an SQL request is a blocking operation, we would build something like processResponseAsync, which takes in a SearchResponse Listener. For including SQL/PPL-related request and response bodies, we could leverage the already existing ext clauses feature in SearchResponse and SearchRequest. ![Screenshot 2024-02-13 at 10 12 52 AM](https://github.com/opensearch-project/OpenSearch/assets/99925918/46ec9855-64e1-4095-bc59-6b14cbbe5bc8) Request Flow * User sends a _search request with sql/ppl block in ext block. * SearchSourceBuilder would parse this block and will put the extBuilders in SearchRequest. * [Need further deep dive] SearchPipelineService identifies these extBuilders and include SQLSearchProcessors and PPLSearchProcessors respectively in the Pipeline. Should we include the SQL and PPL search processors in default pipeline ? * SQL or PPL SearchProcessor would respond back with the SearchResponse through listener passed from the CORE. ## 5. Task breakdown |Stage |Task |Effort |Owner |Status | |--- |--- |--- |--- |--- | |P0 |Design Alignment |2W | Vamsi Manohar | | |P0 |OpenSearch Core QueryEngineSpec Changes |2W | | | |P0 |OpenSearch Core Validation Changes |1W | | | |P0 |OpenSearch Core UTs/ITS |1W | | | |P0 |SQL Changes for supporting PPL, UTs, ITS |2W | | | |P0 |SQL Changes for supporting SQL, UTs, ITS |2W | | | |P0 |SQL Changes for supporting PPL/SQL Explain |1W | | | |P0 |Performance benchmark |2W | | | |P0 |Threat Modelling and Pen testing |2W | | | |P0 |Java Rest client |2W | | | |P0 |Java client |2W | | | |P0 |JavaScript client |4W | |Requires changes only in [opensearch-api-specification](https://github.com/opensearch-project/opensearch-api-specification). Query Params might require separate handling. | |P1 |Python client | | |P1 |Go client | | |P1 |Ruby client | | |P1 |PHP client | | |P1 |.NET client | | |P1 |Rust client | | |P1 |SQL Feature Parity - Query |8W | | | |P1 |SQL Feature Parity - Aggregation |8W | | | ## 6. Open Questions and Edge Case Scenarios. * We should throw validation exception in all the scenarios where the endpoint doesn’t support SQL. * .What should be the validation error when an unsupported request parameter is sent in the request? * What should be the validation error when an sql block is mixed with unrelated dsl query block? * How do we support explain APIs in PPL and SQL API? * We can leverage existing `explain` query param and send the explain response back in ppl block and sql block. * For SQL/PPL, can we exclude unrelated parameters that a usual SearchResponse contains, for example: `took`, `_shards`, `hits`, etc.? * The current idea is to fill them with `-1` and `null`, hinting to the customer that those fields are invalid and that the entire response will be in either the `ext` block or `ppl` block. * Reuse existing search response info to populate these fields. * https://github.com/opensearch-project/opensearch-java/blob/main/java-client/src/main/java/org/opensearch/client/opensearch/OpenSearchClient.java#L1364 * Get More understanding of this. * What is the difference between https://github.com/opensearch-project/opensearch-java/blob/main/java-client/src/main/java/org/opensearch/client/opensearch/core/SearchResponse.java and https://github.com/opensearch-project/opensearch-java/blob/main/java-client/src/main/java/org/opensearch/client/opensearch/core/SearchResponse.java. * How do we maintain client when there are changes in core? * What is the difference between RestHighLevelClient, Opensearch-Java, Transport Node clients? * Can we leverage [SearchExtBuilder](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/SearchExtBuilder.java#L63)(ext block in SearchRequest and SearchResponse) instead of creating a new ppl and sql clause in SearchRequest and SearchResponse. ## 7. POC: * https://github.com/vamsi-amazon/sql/tree/sql-in-dsl * https://github.com/vamsi-amazon/opensearch/tree/sql-in-dsl

anirudha commented 9 months ago

approach 1 (recommended)/ seem independent for query languages. which are a low level building block that can be used in option 2 anyways

vamsi-amazon commented 9 months ago

POC Video: sql_in_dsl-ezgif com-optimize

Draft PRs:

wbeckler commented 9 months ago

Why not modify the client so that it access the plugins/sql endpoint?

model-collapse commented 9 months ago

What is the usecase for this approach?

navneet1v commented 9 months ago

@vamsi-amazon have you thought introducing the ppl/sql as a another query clause rather than coming with a new concept Query engine?

Currently in Opensearch you can define a query type along with how to parse and convert the query into apt Lucene query clause. Wondering if we have explored that option and what is the reason for not choosing that option and rather than building a new concept all together?

This will have many advantages:

User can fit this new ppl/sql query clause with any other complex query.
You will get out box support for various features of search like concurrent segment search etc.
Aggregations and other features can be directly supported with this.

peternied commented 9 months ago

[Triage - attendees 1 2 3 4 5] @vamsi-amazon Thanks for filing this RFC looking forward to seeing where this topic lands.

peternied commented 9 months ago

However, clients using OpenSearch client libraries face limitations, as these libraries do not accommodate plugin endpoints.

Where is this problem explored? If client library support was improved it would benefit the SQL plugin and all other plugins for OpenSearch.

penghuo commented 8 months ago

How does approach 1 works with OpenSearch client library? Do we plan to upgrade OpenSearch client library to support new SQL/PPL query type?

anirudha commented 8 months ago

Why not modify the client so that it access the plugins/sql endpoint?

We already support driver in JDBC / ODBC and dbapi. other opensearch clients would need to support the jdbc/odbc spec and enable access via SQL / PPL; this can be done in the fullness of time. Our goals are not just client user access / but also developer access without introducing inter-plugin dependencies. Many of our users still use the dsl and hand craft DSL queries.

The proposal is not to move code but maintain code modularity by adding a QueryEngineSpec.

@vamsi-amazon have you thought introducing the ppl/sql as a another query clause rather than coming with a new concept Query engine?

Currently in Opensearch you can define a query type along with how to parse and convert the query into apt Lucene query clause. Wondering if we have explored that option and what is the reason for not choosing that option and rather than building a new concept all together?

This will have many advantages:

User can fit this new ppl/sql query clause with any other complex query.

You will get out box support for various features of search like concurrent segment search etc.

Aggregations and other features can be directly supported with this.

SQL is an independent high-level query language hence 1 doesn't apply ; 2,3 can still be used

How does approach 1 works with OpenSearch client library? Do we plan to upgrade OpenSearch client library to support new SQL/PPL query type?

no, it doesn't need to work out of the box. SQL drivers JDBC/ODBC/DBAPI will continue to work for the developers and users.

What is the usecase for this approach?

Integrating SQL/PPL into OpenSearch as standard languages enhances its utility and accessibility. For users, it promises compatibility with JDBC/ODBC and DBAPI clients, opening up OpenSearch to a wider audience. All features, including dashboards, will eventually support SQL/PPL by default, increasing usability. For developers, incorporating these features into the core simplifies development, avoids plugin dependencies while ensures backward compatibility, making OpenSearch a more unified platform for querying. This move positions OpenSearch as a leading relevancy-focused SQL engine with advanced capabilities like highlighting and full-text queries.

PPL reference manual https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/index.rst

SQL reference manual https://github.com/opensearch-project/sql/blob/main/docs/user/index.rst

developer docs https://github.com/opensearch-project/sql/blob/main/docs/dev/index.md

Getting started developer guide https://github.com/opensearch-project/sql/blob/main/DEVELOPER_GUIDE.rst

Drivers https://opensearch.org/downloads.html

this approach will/

Streamlines access to SQL and PPL through the standard Search API, enhancing usability.
Encourages broader adoption by making SQL and PPL features more accessible to users unfamiliar with plugin-specific endpoints.
Supports extensibility through the QueryEngineSpec, allowing for custom query engine implementations.
Improves system architecture by leveraging existing interfaces and patterns, promoting a more unified and coherent platform design.
Enhances the flexibility of OpenSearch by accommodating various query languages within a unified framework.
Addresses current limitations and gaps in functionality with respect to SQL and PPL usage in OpenSearch plugins/ecosystem.
Aims to maintain backward compatibility and minimize disruption to existing workflows and applications.
enable opensearch plugin like alerting to support SQL and PPL based alerts

Opensearch client don't need to support SQL / PPL by default- they are supported by the jdbc/odbc spec'ed drivers and dbapi. Since this is an optional clause clients can ignore it. Search pipelines is not a low level feature to implement a fundamental query language.

msfroh commented 8 months ago

Streamlines access to SQL and PPL through the standard Search API, enhancing usability.

This is the part that bugs me. It's not using the standard Search API.

We want to access SQL/PPL with JDBC/ODBC clients. Sure. The requests are not _search API requests (i.e. SearchRequest). The responses are not _search API responses (i.e. SearchResponse).

Given that we have no interest in SearchRequest and SearchResponse, what does this have to do with the _search API?

For example, could I add a QueryEngineSpec called math, where I send a _search request, like:

localhost:9200/_search
{
   "math" : {
        "query" : "5 * 10 + 3"
    }
}

Then I get back a response like:

{
 "took": 0, 
"timed_out": false, 
"_shards": {
        "total": 0, 
        "successful": 0, 
        "skipped": 0, 
        "failed": 0 
}, 
"hits": {
   "dummy"
}, 
  "math": {
    "answer": 53
  }
}

Is that something we want to support? What things go into the _search API versus their own APIs? Does it make sense to read cluster settings from _search APIs?

There's nothing stopping me from adding a /_math endpoint via a plugin that can support its own API directly:

// Request:
localhost:9200/_math
{
  "expression" : "5 * 10 + 3"
}

// Response:
{
  "answer": 53
}

dblock commented 8 months ago

Is that something we want to support? What things go into the _search API versus their own APIs?

This is the right question, for which I think we need some tenets. Search is over documents that are stored in indexes.

To me, search is defined by 1) parses a query written in some language, 2) evaluates every stored document against that query, 3) matches or doesn't match the document, 4) produces a score for all documents that match, then 5) sorts results my score and 6) returns them.

Is this an acceptable definition @msfroh?

If so, in the case of the math example or settings you're missing 2), 3), 4) and 5), so it doesn't fit under search. In the case of SQL I think it fits that definition where the language to express the query is different.

dblock commented 8 months ago

We want to access SQL/PPL with JDBC/ODBC clients. Sure. The requests are not _search API requests (i.e. SearchRequest). The responses are not _search API responses (i.e. SearchResponse).

This is confusing to me. I suppose I don't understand internals. I think there should be a SearchRequest and SearchResponse independent of the transport API, aka we need RestSearchRequest < SearchRequest, ODBCSearchRequest < SearchRequest, etc.

msfroh commented 8 months ago

This is confusing to me. I suppose I don't understand internals. I think there should be a SearchRequest and SearchResponse independent of the transport API, aka we need RestSearchRequest < SearchRequest, ODBCSearchRequest < SearchRequest, etc.

Aha! I like this.

I think this gets into some of the question of "What is the input used to perform an internal operation versus what is the representation sent over the wire?" that touches on the challenge that @VachaShah has encountered on her Protobuf work, made difficult by the fact that business objects have historically defined their own wire format.

I believe the approach in this proposal is "How can we embed a SQL/PPL representation of a search request inside the existing REST _search API?" Maybe instead, it should be "How can the _search API accommodate different representations of a search request?"

This almost feels like we want to support a different Content-Type for the API (albeit with a more significant interpretation of Content-Type versus the existing XContent framework from which the "business objects == serialized objects" evil arises.) Of course, dispatching a request to a /_search endpoint and forking the logic by Content-Type isn't fundamentally different from just hitting a different endpoint.

Of course, once we're on the cluster, we're forking down completely different paths. The existing SearchRequest class is married to query DSL and is remarkably low-level in its specificity. I gather that the SQL/PPL logic goes and does very different "stuff" that may eventually trigger DSL queries of its own. Ultimately, I don't think we could reasonably say that a SQL/PPL request "extends" a SearchRequest without moving, well, everything into the separate RestSearchRequest -- once you've moved the DSL-specific stuff out, there's not much left.

andrross commented 8 months ago

In the case of SQL I think it fits that definition where the language to express the query is different.

@dblock I think by your definition the took, _shards, hits, etc. fields in the response would be relevant for any search request, but this proposal explicitly calls them "unrelated parameters". If we truly need to exclude those fields then it feels like we're shoehorning something into the API similar to @froh's contrived math example.

msfroh commented 8 months ago

I think I see a way that we can handle this, albeit in two steps:

On the response side, we add support for the schema/datarows concept as an OpenSearch core feature, where anyone can request it (even if they're doing a DSL query). This is arguably a better response format for a lot of use-cases and it's a good feature. (While Lucene and therefore OpenSearch supports a flexible schema where every doc can have its own set of fields, in practice you tend to return a bunch of docs with the same fields -- otherwise it's really hard to use.)
On the request side, we use something like this QueryEngine proposal to process a SearchRequest (including different syntax) and get back a SearchResponse. In this case, from an architecture standpoint, I would perhaps suggest (as in @dblock's message above) we consider it as "different" from a regular REST search request, and we fork off at the REST layer.

That way, we preserve the "SearchResponse returns documents" part that my contrived math example doesn't (though it could send an answer back in a document 😄 ). From a code standpoint, we could split off before trying to parse into a SearchSourceBuilder.

anirudha commented 8 months ago

In the case of SQL I think it fits that definition where the language to express the query is different.

@dblock I think by your definition the took, _shards, hits, etc. fields in the response would be relevant for any search request, but this proposal explicitly calls them "unrelated parameters". If we truly need to exclude those fields then it feels like we're shoehorning something into the API similar to @froh's contrived math example. While I agree with @dblock and @msfroh , lets do this lets to try and standardize; -> but, if we see the DSL structure today; there is to spec or structure.

Opensearch response today is divided in to roughly 2 parts

response metadata ( shards, hits etc. )
response data

we should be able to fill all the response metadata fields; but the response data format I propose be jdbc spec'ed. that is easy/simple to use and understand no matter what aggregation is used.


eg.
{
 "took": 23.1, 
"timed_out": false, 
"_shards": {
        "total": 2, 
        "successful": 2, 
        "skipped": 0, 
        "failed": 0 
}, 
"hits": {
   "324"
}, 
  "ppl": {
    "schema": [...],
    "datarows" : [....]
    }
}

@dblock wrt how you are defining tenets/ i agree, we should have search first experience, not math for example.

SQL /PPL support almost-all opensearch relevancy features in an easy to use high level language https://github.com/opensearch-project/sql/blob/main/docs/user/beyond/fulltext.rst

this would be the only / most powerful SQL dialect that support all relevancy features in a SQL / Piped language which is on a search engine

agree with @msfroh on the final comment here/ https://github.com/opensearch-project/OpenSearch/issues/12434#issuecomment-2000281781

dblock commented 8 months ago

@vamsi-amazon Thoughts on updating the proposal above with the information discussed?

penghuo commented 8 months ago

agree with @msfroh on the final comment of support for the schema/datarows concept

we should also align on the scope and launch criteria of support SQL in _search endpoint

SQL does not support all the DSL query and aggregation, for instance, SQL does not support geo and shape queries. In case mixing SQL and native DSL queries in a single endpoint, any concern?
search on index (index00001/_search) does not align with sql from syntax, user can select any table in sql statement scroll does not align with sql pagination syntax.
scroll / pit search does not align with sql pagination syntax
not all search URI parameters can be supported. for instance, suggest_field
not all search body can be used along with SQL, for instance, docvalue_fields

we should also

Performance benchmark, SQL latency / resource usage should similar to DSL
we also need to align on that new feature introduced in DSL should has correspond implementation in SQL.
SQL support existing DSL features. It is not P0 feature, we can keep improvement.

vamsi-amazon commented 8 months ago

@dblock Got busy with 2.13 release for last couple of days. Will update the proposal with the information discussed.

navneet1v commented 8 months ago

Maybe instead, it should be "How can the _search API accommodate different representations of a search request?"

This is truly awesome question and problem statement we should take deeper look if we want to support different query engine.

But here are some thoughts I have after reading the conversation:

If we have different representation of what a SearchRequest can be basically translating the RestSearchRequest to bunch of engine specific Search Request aren't we just using OpenSearch as _search as a proxy layer to call different engine(say SQL hosted on a remote endpoint).
RestSearchRequest can handle more than 1 type of engine, BulkRequest or IndexRequest should also support methods to index/put data into underline storage of Query engine.
If do support lets say both Index and Search request to different engine, what we built out is Opensearch as a thin distributed system. So to avoid getting into this, we need to tie in Opensearch Index some how with Index and Search Request. Otherwise what is point of having a data-node and shard etc.? because all you need is a machine that is just converting one type of payload to engine specific payload aka Query.

dblock commented 8 months ago

@penghuo you have good examples of how SQL doesn't align well with other queries

@navneet1v Does your question boil down to the fact that search needs to be aware of what kind of index it is?

I think both comments are just generic questions of what's constant and what's variable (OO abstractions). For example scroll / pit search does not align with sql pagination syntax - "scrolling" is a common feature over data, aka there should be a set of interfaces that together implement scrolling, but inputs may differ between SQL (e.g. offset and limit) and DSL (e.g. cursor and size), and engines can have different implementations and effectively perform scrolling functions differently, all while data fetching or distributing requests to shards are common.

anirudha commented 7 months ago

Summarizing the recent discussions above and evaluation regarding the integration of SQL into the OpenSearch core.

We explored various strategies listed in the updated description/ for incorporating SQL into the OpenSearch core, ultimately recommending against integration in favor of enhancing client capabilities.

The initial approaches considered involve directly integrating SQL into the _search endpoint or as a new endpoint within the core, each presenting distinct advantages. However, these strategies also face significant drawbacks, including limited parity in DSL query support, non-uniform response structures, increased complexity in core repo, potential build system integration challenges and significantly increased cost in upfront development with no extra customer value over the current approach.

The preferred approach advocates for maintaining the current setup with added transport clients for SQL, presenting a minimal change strategy that ensures compatibility with other plugins.

(Preferred): No Core Integration; Enhance Client Capabilities and Add Cohesion in core dashboards features.

The preferred approach is aiming to introduce a transport API for SQL and a new transport client library. This method ensures SQL compatibility with other plugins while maintaining the current system's integrity and minimizing changes.

“ The most compelling reasons I would see for merging the SQL code into core is if we think that long term the SQL query engine might want to integrate directly into the low-level Lucene query engine and there might benefits to having the query DSL implementation living side-by-side with the SQL implementation. We'd probably want to tease the entire current query DSL engine out of the server module into its own thing, which could then also contain the SQL implementation. However, I don't really see this happening or a need for this anytime soon. SQL depends on and uses the DSL and custom scripting via DSL; It inherits all DSL advances that the core project will make.

If the plan is to keep the SQL implementation as either a plugin or module component in the core repository, then moving the code from one repository to another seems more like an administrative decision than a software architecture one. “/paraphrasing @andrross /

kavilla commented 5 months ago

@dblock https://github.com/opensearch-project/OpenSearch-Dashboards/issues/7081 if you have some time to check this one out too?

dblock commented 5 months ago

@kavilla it looks like this proposal was closed?

opensearch-project / OpenSearch

[RFC] Integrating SQL/PPL query languages into DSL via the _search API #12434

SQL/PPL via DSL in Search API.

1. Problem Statement.

2. Summary

3. Tenets:

4. Solution

4.1.Search API

4.1.1. Endpoint

4.1.2.URL Parameters

4.1.3.Request Body

4.1.4.Response

4.2.Feature Parity - SQL vs DSL

4.3.Performance

4.4.Client

4.5.Security

5. Detailed Design.

5.1 Approach 1: Extend SearchPlugin Interface and Integrate with SQL plugin.