opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
116 stars 134 forks source link

JOIN support improvement #49

Open dai-chen opened 3 years ago

dai-chen commented 3 years ago

Is your feature request related to a problem? Please describe.

There is limited support for JOIN queries which needs improvement. In particular, post processing such as filtering and aggregation after inner/outer join is most wanted. This capability is required by PPL lookup command.

Here are the feature requests from community:

  1. https://github.com/opendistro-for-elasticsearch/sql/issues/505: Left join
  2. https://github.com/opendistro-for-elasticsearch/sql/issues/351: Multi-join
  3. https://github.com/opendistro-for-elasticsearch/sql/issues/301: Aggregate after join
  4. https://github.com/opendistro-for-elasticsearch/sql/issues/289: Select function after join
  5. https://github.com/opendistro-for-elasticsearch/sql/issues/236: Aggregate after join
  6. https://github.com/opendistro-for-elasticsearch/sql/issues/221: Comma join
  7. https://github.com/opendistro-for-elasticsearch/sql/issues/124: Left join with WHERE
  8. https://github.com/opendistro-for-elasticsearch/sql/issues/110: Aggregate after join
  9. https://github.com/opendistro-for-elasticsearch/sql/issues/987: Multi-join

Describe the solution you'd like N/A

Describe alternatives you've considered N/A

Additional context Add any other context or screenshots about the feature request here.

chloe-zh commented 3 years ago

JOIN with nested loops join algorithm in new engine PoC: https://github.com/chloe-zh/opensearch-sql/tree/poc/join

dai-chen commented 1 year ago

Inquiry: https://github.com/opensearch-project/sql/issues/857

dai-chen commented 1 year ago

Requirements in observability: https://github.com/opensearch-project/sql/issues/892

dai-chen commented 1 year ago

Another option we can explore is to reuse legacy JOIN implementation in v2 engine directly by:

  1. ExprValueRow wraps an ExprValue into legacy Row interface.
  2. LegacyToV2Adapter adapts legacy join operator to behave like a V2 operator.
  3. V2ToLegacyAdapter adapts V2 operator for integrating with legacy join operator.

Untitled Diagram

The main blocker to make this happen is v2 engine doesn't support symbol resolution very well. For example, SELECT t1.name FROM index t1 ..., alias t1 will be removed after analysis. There is no alias in DSL query and value environment. Finally the value is named t1.name by project operator. This will cause problem in analysis and execution in JOIN case which has 2 or more table aliased (See the column names and null value in result below). Anyway, this is something to improve in v2 engine itself instead of problem in this quick workaround.

Test branch: https://github.com/dai-chen/sql-1/tree/test-legacy-join-port

PUT account-test/_doc/1
{ 
  "name": "John",
  "balance": 12000
}

PUT account-test/_doc/2
{ 
  "name": "John",
  "balance": 10000
}

PUT account-test/_doc/3
{ 
  "name": "Allen",
  "balance": 25000
}

PUT customer-test/_doc/1
{ 
  "name": "John",
  "age": 25,
  "city": "Seattle"
}

PUT customer-test/_doc/2
{ 
  "name": "Allen",
  "age": 30,
  "city": "Portland"
}
POST _plugins/_sql
{
  "query": """
    SELECT t1.name, t1.balance, t2.age, t2.city
    FROM account-test t1
    JOIN customer-test t2
    ON t1.name = t2.name
  """
}
{
  "schema": [
    {
      "name": "name",
      "type": "text"
    },
    {
      "name": "balance",
      "type": "long"
    },
    {
      "name": "age",
      "type": "long"
    },
    {
      "name": "city",
      "type": "text"
    }
  ],
  "datarows": [
    [
      "John",
      10000,
      25,
      "Seattle"
    ],
    [
      "John",
      12000,
      25,
      "Seattle"
    ],
    [
      "Allen",
      25000,
      30,
      "Portland"
    ]
  ],
  "total": 3,
  "size": 3,
  "status": 200
}

POST _plugins/_sql
{
  "query": """
    SELECT
     t2.city,
     AVG(t1.balance) AS avgBal
    FROM account-test t1
    JOIN customer-test t2
    ON t1.name = t2.name
    GROUP BY t2.city
    ORDER BY avgBal
  """
}
{
  "schema": [
    {
      "name": "city",
      "type": "text"
    },
    {
      "name": "AVG(t1.balance)",
      "alias": "avgBal",
      "type": "double"
    }
  ],
  "datarows": [
    [
      null,
      11000.0
    ],
    [
      null,
      25000.0
    ]
  ],
  "total": 2,
  "size": 2,
  "status": 200
}
dai-chen commented 1 month ago

https://github.com/opensearch-project/sql/issues/2873