opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
121 stars 140 forks source link

[FEATURE] OpenSearch SQL/PPL multiple indices support #3099

Open penghuo opened 1 month ago

penghuo commented 1 month ago

Is your feature request related to a problem? Currently, OpenSearch SQL/PPL support query multiple indices. for instance

The OpenSearch _search API supports comma-separated lists of indices, wildcard (*) expressions, and table relations in the source and from clauses, following the same conventions. This applies to both OpenSearch SQL and PPL. However, the case-2 grammar contradicts the SQL comma join syntax.

Related issue, https://github.com/opensearch-project/sql/issues/51 Early discussion, https://github.com/opensearch-project/opensearch-spark/pull/775

What solution would you like? I propose a nameing conveing to map opensearch index to table relation. You can use a comma-separated list indices, remote index as table relation, for instance index1,remote:test2,test3. Index name could also include wildcard () and question mark (?) to target index that match a pattern: index?, ind. In SQL and PPL, if a table name or relation includes special characters (e.g., ','), it should be enclosed in backticks (`).

tableRelation := indexNameExpression [,indexNameExpression]
indexNameExpression := [-][cluster:]:[localIndexNameExpression]
localIndexNameExpression := [*?\w\d]
LantaoJin commented 1 month ago

Copy some context in comment to here:

PPL on OpenSearch supports:

  1. source=accounts, account2
  2. source=`accounts`,`account2`
  3. source=`account*`
  4. source=`accounts, account2`

But PPL on Spark supports the first two currently. I think the third one (wildcard) should be supported in PPL on Spark too. But I would suggest to mark the fourth one as invalid since users will treat the content in backticks as a whole as usual. We have some examples from real, users from CloudWatch provided a table pattern:

`_CWLBasic`.default.`Canary/service.log`

And quote their requests:

So if query is select a.b.Service.Log where service.log is table name I am sure sql will also split everything by dot and consider only Log as table name while preceding stuff as namespace. However any entity enclosed in backticks sql considers as a single entity. So if query is select a.b.’Service.log’ where table name is enclosed in backticks sql will only consider a and b as name space and ignore everything which is within backticks

So I think sql considers everything enclosed in backticks as a single entity. `Canary/service.log` should be one whole.

My suggestion is treating content in backticks as a whole. For instance:

YANG-DB commented 1 month ago

@penghuo I agree with @LantaoJin - we should support content in backticks as a whole