opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
120 stars 139 forks source link

Allow individual shards to be targeted during query execution [FEATURE] #1478

Open akuzin1 opened 1 year ago

akuzin1 commented 1 year ago

Is your feature request related to a problem?

In MySQL one can retrieve partition information about a table, which can later be used to target specific partitions during query execution.

The following is an example of a query that can be used to retrieve partition information of a specific table.

"SELECT DISTINCT partition_name FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME = <table_name> AND TABLE_SCHEMA = <table_schema> " +
            "AND partition_name IS NOT NULL"

Next is an example of a query targeting a specific partition in a table.

SELECT * FROM table PARTITION (partitionName);

When applying this to opensearch, partitions could be treated as the equivalent of shards for our use case.

What solution would you like? It would be great to be able to treat shards in Opensearch as the equivalent to MySQL Partitions and be able to query individual shards.

What alternatives have you considered? We've considered generating splits based on hashing a key or tuple of keys and then modulo that against some fixed number of splits that we want to generate.

Example:
For example for 3 splits:

split 1:
SELECT *
FROM some_unpartitioned_table
WHERE hash(col1, col2, col3) % 10 == 0

split 2:
SELECT *
FROM some_unpartitioned_table
WHERE hash(col1, col2, col3) % 10 == 1

split 3:
SELECT *
FROM some_unpartitioned_table
WHERE hash(col1, col2, col3) % 10 == 2

However, it doesn't seem like there is a hashing function like that available.

Therefore, the above mentioned solution would be a great behavior to add for all sql users, to more closely mimic the behavior and syntax of MySQL.

dai-chen commented 1 year ago

@akuzin1 Thanks for the feature request! This recall an old request previously: https://github.com/opendistro-for-elasticsearch/sql/issues/1151. @acarbonetto is adding OpenSearch meta field support to our SQL engine. I'm thinking can we support this as part of meta field work.

akuzin1 commented 1 year ago

Sounds good, that would be great to see. Is there an estimated timeline for when this feature would be released? Thank you

dai-chen commented 1 year ago

@acarbonetto @Yury-Fridlyand @MaxKsyunz ^^

acarbonetto commented 1 year ago

This will be a dependency for https://github.com/opensearch-project/sql/issues/1441 The API on OpenSearch side is _routing which we should consider using instead of _shard (or both!)

acarbonetto commented 1 year ago

related: https://github.com/opensearch-project/sql/issues/339

akuzin1 commented 1 year ago

Hi! Saw that the label was changed to "Priority-High" which is great to see, so wanted to check in and see if there is an estimated date for delivering of this feature? Thank you.

acarbonetto commented 1 year ago

Hi! Saw that the label was changed to "Priority-High" which is great to see, so wanted to check in and see if there is an estimated date for delivering of this feature? Thank you.

Target release is 2.10 at the moment. Hoping to get this in a little sooner.

acarbonetto commented 1 year ago

Proposal for setting routing field on the search request. There's two ways forward that I'd like to propose. But first a little contexts:

Goal

The objective of the partition/routing shard is to include the routing ID in the SearchRequest builder.

Example:

new SearchRequest()
        .indices(indexName.getIndexNames())
        .routing(routingId.getIndexNames())
        .scroll(scrollTimeout)
        .source(sourceBuilder);

Getting the routing id(s) from the initial query into pushdown can take one of two obvious routes.

Proposal 1: Request Parameter ("routing")

Syntax addition, includes a new json parameter only available in the V2 engine.

{
    "query": "select _id, _index, _routing, int0 from calcs_routing limit 5",
    "routing": "FURNITURE"
}

This includes an opensearch-similar API to target individual or lists of shards. The string will accept a comma-separated list of shard ID targets.

PoC Architecture Change

To get the request into OpenSearchRequest as part of pushdown requires that we create an AbstractPlan, LogicalPlan and PhysicalPlan operator like the Paginate and LogicalPaginate operators. This would allow us to pushdown a routingId string into the OpenSearchIndexScanQueryBuilder during execution. We would create Partition operators in a similar manner as the Pagination operators without much business logic. Reference to how paginate works: https://github.com/opensearch-project/sql/blob/main/docs/dev/Pagination-v2.md#unresolved-query-plan

Considerations: we may consider combining Paginate and Partition into a single set of operators, and call them something like PushDownConfiguration and push all the parameters at once. This would allow us to scale by configuration values without needing to add more operators.

Pros

Cons

Proposal 2: SQL FROM table PARTITION (key, key, ...) syntax

As defined in https://dev.mysql.com/doc/refman/8.0/en/partitioning-selection.html, MySQL (and other SQL engines) allows a query to target a specific partition by id. This also allows for multiple ids in the PARTITION function.

{
    "query": "select _id, _index, _routing, int0 from calcs_routing PARTITON(\"FUNITURE, OFFICE SPACE\") limit 5"
}

PoC Architecture Change

Update the parser syntax to allow for the PARTITION function in the FROM clause.

Ultimately, to get the routing ID(s) into the SearchRequest, we need to add the PARTITON keys to to IndexScan (which extends Table). Then in the IndexScanBuilder, send the routing ID(s) to the SearchRequest.

To accomplish this, we could push the partition keys into the Table (OpenSearchIndexScan) as partition keys during the analysis phase. On pushdown, the partition keys would be part of the IndexScan and could semi-easily push down to the OpenSearchRequest as part of the Index.

OpenSearchIndex already has the concept of an Index, and Routing Keys would have the same architecture:

Pros

Cons

macohen commented 9 months ago

Will there be a new projected release target for this feature, @acarbonetto? I see it was targeted for 2.11 and 2.12 is being released soon. Can this make 2.13, perhaps?