[FEATURE] OpenSearch SQL/PPL multiple indices support

penghuo commented 2 weeks ago

Is your feature request related to a problem? Currently, OpenSearch SQL/PPL support query multiple indices. for instance

case-1, source=acc | stats count() / select count() from acc
case-2, source=accounts, account2 | stats count() / select count() from accounts,account2
case-3, ``source=accounts,account2| stats count(); / select count fromaccounts,account2````

The OpenSearch _search API supports comma-separated lists of indices, wildcard (*) expressions, and table relations in the source and from clauses, following the same conventions. This applies to both OpenSearch SQL and PPL. However, the case-2 grammar contradicts the SQL comma join syntax.

What solution would you like? I propose a nameing conveing to map opensearch index to table relation. You can use a comma-separated list indices, remote index as table relation, for instance index1,remote:test2,test3. Index name could also include wildcard () and question mark (?) to target index that match a pattern: index?, ind. In SQL and PPL, if a table name or relation includes special characters (e.g., ','), it should be enclosed in backticks (`).

tableRelation := indexNameExpression [,indexNameExpression]
indexNameExpression := [-][cluster:]:[localIndexNameExpression]
localIndexNameExpression := [*?\w\d]

LantaoJin commented 2 weeks ago

Copy some context in comment to here:

PPL on OpenSearch supports:

source=accounts, account2
source=`accounts`,`account2`
source=`account*`
source=`accounts, account2`

But PPL on Spark supports the first two currently. I think the third one (wildcard) should be supported in PPL on Spark too. But I would suggest to mark the fourth one as invalid since users will treat the content in backticks as a whole as usual. We have some examples from real, users from CloudWatch provided a table pattern:

`_CWLBasic`.default.`Canary/service.log`

And quote their requests:

So if query is select a.b.Service.Log where service.log is table name I am sure sql will also split everything by dot and consider only Log as table name while preceding stuff as namespace. However any entity enclosed in backticks sql considers as a single entity. So if query is select a.b.’Service.log’ where table name is enclosed in backticks sql will only consider a and b as name space and ignore everything which is within backticks

So I think sql considers everything enclosed in backticks as a single entity. `Canary/service.log` should be one whole.

My suggestion is treating content in backticks as a whole. For instance:

:white_check_mark: source=`catalog`.`namespace`.`index-2024*`, `catalog`.`namespace`.`index-2023-12`
:white_check_mark: source=`catalog`.`namespace`.index-2024*, index-2023-12
:x: source=`catalog`.`namespace`.`index-2024*, index-2023-12`

YANG-DB commented 2 weeks ago

@penghuo I agree with @LantaoJin - we should support content in backticks as a whole

opensearch-project / sql

[FEATURE] OpenSearch SQL/PPL multiple indices support #3099