Closed vamsi-amazon closed 7 months ago
approach 1 (recommended)/ seem independent for query languages. which are a low level building block that can be used in option 2 anyways
Why not modify the client so that it access the plugins/sql endpoint?
What is the usecase for this approach?
@vamsi-amazon have you thought introducing the ppl/sql as a another query clause rather than coming with a new concept Query engine?
Currently in Opensearch you can define a query type along with how to parse and convert the query into apt Lucene query clause. Wondering if we have explored that option and what is the reason for not choosing that option and rather than building a new concept all together?
This will have many advantages:
However, clients using OpenSearch client libraries face limitations, as these libraries do not accommodate plugin endpoints.
Where is this problem explored? If client library support was improved it would benefit the SQL plugin and all other plugins for OpenSearch.
How does approach 1
works with OpenSearch client library? Do we plan to upgrade OpenSearch client library to support new SQL/PPL query type?
Why not modify the client so that it access the plugins/sql endpoint?
We already support driver in JDBC / ODBC and dbapi. other opensearch clients would need to support the jdbc/odbc spec and enable access via SQL / PPL; this can be done in the fullness of time. Our goals are not just client user access / but also developer access without introducing inter-plugin dependencies. Many of our users still use the dsl and hand craft DSL queries.
The proposal is not to move code but maintain code modularity by adding a QueryEngineSpec.
@vamsi-amazon have you thought introducing the ppl/sql as a another query clause rather than coming with a new concept Query engine?
Currently in Opensearch you can define a query type along with how to parse and convert the query into apt Lucene query clause. Wondering if we have explored that option and what is the reason for not choosing that option and rather than building a new concept all together?
This will have many advantages:
- User can fit this new ppl/sql query clause with any other complex query.
- You will get out box support for various features of search like concurrent segment search etc.
- Aggregations and other features can be directly supported with this.
SQL is an independent high-level query language hence 1 doesn't apply ; 2,3 can still be used
How does
approach 1
works with OpenSearch client library? Do we plan to upgrade OpenSearch client library to support new SQL/PPL query type?
no, it doesn't need to work out of the box. SQL drivers JDBC/ODBC/DBAPI will continue to work for the developers and users.
What is the usecase for this approach?
Integrating SQL/PPL into OpenSearch as standard languages enhances its utility and accessibility. For users, it promises compatibility with JDBC/ODBC and DBAPI clients, opening up OpenSearch to a wider audience. All features, including dashboards, will eventually support SQL/PPL by default, increasing usability. For developers, incorporating these features into the core simplifies development, avoids plugin dependencies while ensures backward compatibility, making OpenSearch a more unified platform for querying. This move positions OpenSearch as a leading relevancy-focused SQL engine with advanced capabilities like highlighting and full-text queries.
PPL reference manual https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/index.rst
SQL reference manual https://github.com/opensearch-project/sql/blob/main/docs/user/index.rst
developer docs https://github.com/opensearch-project/sql/blob/main/docs/dev/index.md
Getting started developer guide https://github.com/opensearch-project/sql/blob/main/DEVELOPER_GUIDE.rst
Drivers https://opensearch.org/downloads.html
this approach will/
Opensearch client don't need to support SQL / PPL by default- they are supported by the jdbc/odbc spec'ed drivers and dbapi. Since this is an optional clause clients can ignore it. Search pipelines is not a low level feature to implement a fundamental query language.
Streamlines access to SQL and PPL through the standard Search API, enhancing usability.
This is the part that bugs me. It's not using the standard Search API.
We want to access SQL/PPL with JDBC/ODBC clients. Sure. The requests are not _search
API requests (i.e. SearchRequest
). The responses are not _search
API responses (i.e. SearchResponse
).
Given that we have no interest in SearchRequest
and SearchResponse
, what does this have to do with the _search
API?
For example, could I add a QueryEngineSpec
called math
, where I send a _search
request, like:
localhost:9200/_search
{
"math" : {
"query" : "5 * 10 + 3"
}
}
Then I get back a response like:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 0,
"successful": 0,
"skipped": 0,
"failed": 0
},
"hits": {
"dummy"
},
"math": {
"answer": 53
}
}
Is that something we want to support? What things go into the _search
API versus their own APIs? Does it make sense to read cluster settings from _search
APIs?
There's nothing stopping me from adding a /_math
endpoint via a plugin that can support its own API directly:
// Request:
localhost:9200/_math
{
"expression" : "5 * 10 + 3"
}
// Response:
{
"answer": 53
}
Is that something we want to support? What things go into the
_search
API versus their own APIs?
This is the right question, for which I think we need some tenets. Search is over documents that are stored in indexes.
To me, search is defined by 1) parses a query written in some language, 2) evaluates every stored document against that query, 3) matches or doesn't match the document, 4) produces a score for all documents that match, then 5) sorts results my score and 6) returns them.
Is this an acceptable definition @msfroh?
If so, in the case of the math example or settings you're missing 2), 3), 4) and 5), so it doesn't fit under search. In the case of SQL I think it fits that definition where the language to express the query is different.
We want to access SQL/PPL with JDBC/ODBC clients. Sure. The requests are not
_search
API requests (i.e.SearchRequest
). The responses are not_search
API responses (i.e.SearchResponse
).
This is confusing to me. I suppose I don't understand internals. I think there should be a SearchRequest
and SearchResponse
independent of the transport API, aka we need RestSearchRequest < SearchRequest
, ODBCSearchRequest < SearchRequest
, etc.
This is confusing to me. I suppose I don't understand internals. I think there should be a SearchRequest and SearchResponse independent of the transport API, aka we need RestSearchRequest < SearchRequest, ODBCSearchRequest < SearchRequest, etc.
Aha! I like this.
I think this gets into some of the question of "What is the input used to perform an internal operation versus what is the representation sent over the wire?" that touches on the challenge that @VachaShah has encountered on her Protobuf work, made difficult by the fact that business objects have historically defined their own wire format.
I believe the approach in this proposal is "How can we embed a SQL/PPL representation of a search request inside the existing REST _search
API?" Maybe instead, it should be "How can the _search
API accommodate different representations of a search request?"
This almost feels like we want to support a different Content-Type
for the API (albeit with a more significant interpretation of Content-Type
versus the existing XContent
framework from which the "business objects == serialized objects" evil arises.) Of course, dispatching a request to a /_search
endpoint and forking the logic by Content-Type
isn't fundamentally different from just hitting a different endpoint.
Of course, once we're on the cluster, we're forking down completely different paths. The existing SearchRequest
class is married to query DSL and is remarkably low-level in its specificity. I gather that the SQL/PPL logic goes and does very different "stuff" that may eventually trigger DSL queries of its own. Ultimately, I don't think we could reasonably say that a SQL/PPL request "extends" a SearchRequest
without moving, well, everything into the separate RestSearchRequest
-- once you've moved the DSL-specific stuff out, there's not much left.
In the case of SQL I think it fits that definition where the language to express the query is different.
@dblock I think by your definition the took, _shards, hits, etc. fields in the response would be relevant for any search request, but this proposal explicitly calls them "unrelated parameters". If we truly need to exclude those fields then it feels like we're shoehorning something into the API similar to @froh's contrived math
example.
I think I see a way that we can handle this, albeit in two steps:
QueryEngine
proposal to process a SearchRequest
(including different syntax) and get back a SearchResponse
. In this case, from an architecture standpoint, I would perhaps suggest (as in @dblock's message above) we consider it as "different" from a regular REST search request, and we fork off at the REST layer.That way, we preserve the "SearchResponse returns documents" part that my contrived math example doesn't (though it could send an answer back in a document 😄 ). From a code standpoint, we could split off before trying to parse into a SearchSourceBuilder
.
In the case of SQL I think it fits that definition where the language to express the query is different.
@dblock I think by your definition the took, _shards, hits, etc. fields in the response would be relevant for any search request, but this proposal explicitly calls them "unrelated parameters". If we truly need to exclude those fields then it feels like we're shoehorning something into the API similar to @froh's contrived
math
example. While I agree with @dblock and @msfroh , lets do this lets to try and standardize; -> but, if we see the DSL structure today; there is to spec or structure.
Opensearch response today is divided in to roughly 2 parts
we should be able to fill all the response metadata fields; but the response data format I propose be jdbc spec'ed. that is easy/simple to use and understand no matter what aggregation is used.
eg.
{
"took": 23.1,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"324"
},
"ppl": {
"schema": [...],
"datarows" : [....]
}
}
@dblock wrt how you are defining tenets/ i agree, we should have search first experience, not math for example.
SQL /PPL support almost-all opensearch relevancy features in an easy to use high level language https://github.com/opensearch-project/sql/blob/main/docs/user/beyond/fulltext.rst
this would be the only / most powerful SQL dialect that support all relevancy features in a SQL / Piped language which is on a search engine
agree with @msfroh on the final comment here/ https://github.com/opensearch-project/OpenSearch/issues/12434#issuecomment-2000281781
@vamsi-amazon Thoughts on updating the proposal above with the information discussed?
agree with @msfroh on the final comment of support for the schema/datarows concept
we should also align on the scope and launch criteria of support SQL in _search
endpoint
we should also
@dblock Got busy with 2.13 release for last couple of days. Will update the proposal with the information discussed.
Maybe instead, it should be "How can the
_search
API accommodate different representations of a search request?"
This is truly awesome question and problem statement we should take deeper look if we want to support different query engine.
But here are some thoughts I have after reading the conversation:
@penghuo you have good examples of how SQL doesn't align well with other queries
@navneet1v Does your question boil down to the fact that search needs to be aware of what kind of index it is?
I think both comments are just generic questions of what's constant and what's variable (OO abstractions). For example scroll / pit search does not align with sql pagination syntax
- "scrolling" is a common feature over data, aka there should be a set of interfaces that together implement scrolling, but inputs may differ between SQL (e.g. offset and limit) and DSL (e.g. cursor and size), and engines can have different implementations and effectively perform scrolling functions differently, all while data fetching or distributing requests to shards are common.
Summarizing the recent discussions above and evaluation regarding the integration of SQL into the OpenSearch core.
We explored various strategies listed in the updated description/ for incorporating SQL into the OpenSearch core, ultimately recommending against integration in favor of enhancing client capabilities.
The initial approaches considered involve directly integrating SQL into the _search endpoint or as a new endpoint within the core, each presenting distinct advantages. However, these strategies also face significant drawbacks, including limited parity in DSL query support, non-uniform response structures, increased complexity in core repo, potential build system integration challenges and significantly increased cost in upfront development with no extra customer value over the current approach.
The preferred approach advocates for maintaining the current setup with added transport clients for SQL, presenting a minimal change strategy that ensures compatibility with other plugins.
(Preferred): No Core Integration; Enhance Client Capabilities and Add Cohesion in core dashboards features.
The preferred approach is aiming to introduce a transport API for SQL and a new transport client library. This method ensures SQL compatibility with other plugins while maintaining the current system's integrity and minimizing changes.
“ The most compelling reasons I would see for merging the SQL code into core is if we think that long term the SQL query engine might want to integrate directly into the low-level Lucene query engine and there might benefits to having the query DSL implementation living side-by-side with the SQL implementation. We'd probably want to tease the entire current query DSL engine out of the server module into its own thing, which could then also contain the SQL implementation. However, I don't really see this happening or a need for this anytime soon. SQL depends on and uses the DSL and custom scripting via DSL; It inherits all DSL advances that the core project will make.
If the plan is to keep the SQL implementation as either a plugin or module component in the core repository, then moving the code from one repository to another seems more like an administrative decision than a software architecture one. “/paraphrasing @andrross /
@dblock https://github.com/opensearch-project/OpenSearch-Dashboards/issues/7081 if you have some time to check this one out too?
@kavilla it looks like this proposal was closed?
SQL/PPL via DSL in Search API.
1. Problem Statement.
Today, OpenSearch offers support for SQL and PPL query languages through the plugin endpoints
_plugins/sql
and_plugins/ppl
. However, clients using OpenSearch client libraries face limitations, as these libraries do not accommodate with plugin endpoints.To increase adoption with minimal disruption, our proposal introduces new SQL and PPL clauses directly into the SearchRequest body. This approach aims to facilitate the use of these languages through the Search API, streamlining access and integration for users.2. Summary
_search
API (e.g."result_format":"hit_object"
vs"result_format":"datarow"
) to support existing format or datarows. By default we still usehit_object
._search
API. If you send a SQL query and don’t explicitly specifyresult_format
, the format defaults todatarow
.From user perspective, the following example demonstrate SQL vis DSL request and response.
In this doc, we will discuss detailed design, limitations, and development plan.
3. Tenets:
4. Solution
4.1.Search API
4.1.1. Endpoint
DSL is being used in above APIs majorly and also asynchronous search. Validation exception would be thrown whenever a sql block is encountered in above unsupported APIs.
4.1.2.URL Parameters
All URL parameters are not supported.
4.1.3.Request Body
Sample Request Body:
aggs
parameter, you can define any number of aggregations. Each aggregation is defined by its name and one of the types of aggregations that OpenSearch supports. For more information, see Aggregations._source
field in the response.4.1.4.Response
If query type is SQL, response format is datarows. query response include datarows section. datarows section include schema and datarows, for example
4.2.Feature Parity - SQL vs DSL
Currently, SQL does not fully support all DSL query and aggregations. The following table highlight key query features missing support in SQL
and metrics aggregation function missing in SQL
4.3.Performance
SQL queries through the search endpoint should offer performance comparable to that of DSL queries. Users should not experience any degradation in performance. We use OpenSearch benchmark framework to compare DSL query and SQL via Search.
4.4.Client
All OpenSearch client should support SQL query and datarow response. We have three different types of clients.
4.5.Security
Calls to SQL via _search include index names in the request body, so they have the same access policy considerations as the bulk, mget, and msearch operations.
5. Detailed Design.
5.1 Approach 1: Extend SearchPlugin Interface and Integrate with SQL plugin.
In this approach, we will add a new function to the existing SearchPlugin interface to introduce a new construct called
QueryEngineSpec
. Plugins can implement theSearchPlugin
interface to introduce a query engine that takes over the search request to produce theSearchResponse
.QueryEngineSpec
defines the name of the spec, which will be the key in the DSL under which the respective query engine request parameters are enclosed. Once OpenSearch-core receives a request with a clause containing a key defined by aQueryEngineSpec
, OpenSearch-core creates the Query Engine and transfers the request to the plugin via the Query Engine Object.We would be introducing a new field
data_rows_output
in InternalSearchResponse. SQL, PPL Plugins would populate this field and also other meta information oftook
,shards
,timed_out
information. In case of normal DSL query, hits object would be formulated.