JOIN support improvement

dai-chen commented 3 years ago

Is your feature request related to a problem? Please describe.

There is limited support for JOIN queries which needs improvement. In particular, post processing such as filtering and aggregation after inner/outer join is most wanted. This capability is required by PPL lookup command.

Here are the feature requests from community:

https://github.com/opendistro-for-elasticsearch/sql/issues/505: Left join
https://github.com/opendistro-for-elasticsearch/sql/issues/351: Multi-join
https://github.com/opendistro-for-elasticsearch/sql/issues/301: Aggregate after join
https://github.com/opendistro-for-elasticsearch/sql/issues/289: Select function after join
https://github.com/opendistro-for-elasticsearch/sql/issues/236: Aggregate after join
https://github.com/opendistro-for-elasticsearch/sql/issues/221: Comma join
https://github.com/opendistro-for-elasticsearch/sql/issues/124: Left join with WHERE
https://github.com/opendistro-for-elasticsearch/sql/issues/110: Aggregate after join
https://github.com/opendistro-for-elasticsearch/sql/issues/987: Multi-join

Describe the solution you'd like N/A

Describe alternatives you've considered N/A

Additional context Add any other context or screenshots about the feature request here.

chloe-zh commented 3 years ago

JOIN with nested loops join algorithm in new engine PoC: https://github.com/chloe-zh/opensearch-sql/tree/poc/join

dai-chen commented 1 year ago

Inquiry: https://github.com/opensearch-project/sql/issues/857

dai-chen commented 1 year ago

Requirements in observability: https://github.com/opensearch-project/sql/issues/892

dai-chen commented 1 year ago

Another option we can explore is to reuse legacy JOIN implementation in v2 engine directly by:

ExprValueRow wraps an ExprValue into legacy Row interface.
LegacyToV2Adapter adapts legacy join operator to behave like a V2 operator.
V2ToLegacyAdapter adapts V2 operator for integrating with legacy join operator.

Untitled Diagram

Pros:
- Minimum efforts to get JOIN work in v2 engine to a) support lookup in PPL; b) fix most of the issue above.
- Legacy code was basically functional tested and performance tested.
Cons:
- May have unknown issue with certain data type or functions.
- Still single node implementation

The main blocker to make this happen is v2 engine doesn't support symbol resolution very well. For example, SELECT t1.name FROM index t1 ..., alias t1 will be removed after analysis. There is no alias in DSL query and value environment. Finally the value is named t1.name by project operator. This will cause problem in analysis and execution in JOIN case which has 2 or more table aliased (See the column names and null value in result below). Anyway, this is something to improve in v2 engine itself instead of problem in this quick workaround.

Test branch: https://github.com/dai-chen/sql-1/tree/test-legacy-join-port

PUT account-test/_doc/1
{ 
  "name": "John",
  "balance": 12000
}

PUT account-test/_doc/2
{ 
  "name": "John",
  "balance": 10000
}

PUT account-test/_doc/3
{ 
  "name": "Allen",
  "balance": 25000
}

PUT customer-test/_doc/1
{ 
  "name": "John",
  "age": 25,
  "city": "Seattle"
}

PUT customer-test/_doc/2
{ 
  "name": "Allen",
  "age": 30,
  "city": "Portland"
}

POST _plugins/_sql
{
  "query": """
    SELECT t1.name, t1.balance, t2.age, t2.city
    FROM account-test t1
    JOIN customer-test t2
    ON t1.name = t2.name
  """
}
{
  "schema": [
    {
      "name": "name",
      "type": "text"
    },
    {
      "name": "balance",
      "type": "long"
    },
    {
      "name": "age",
      "type": "long"
    },
    {
      "name": "city",
      "type": "text"
    }
  ],
  "datarows": [
    [
      "John",
      10000,
      25,
      "Seattle"
    ],
    [
      "John",
      12000,
      25,
      "Seattle"
    ],
    [
      "Allen",
      25000,
      30,
      "Portland"
    ]
  ],
  "total": 3,
  "size": 3,
  "status": 200
}

POST _plugins/_sql
{
  "query": """
    SELECT
     t2.city,
     AVG(t1.balance) AS avgBal
    FROM account-test t1
    JOIN customer-test t2
    ON t1.name = t2.name
    GROUP BY t2.city
    ORDER BY avgBal
  """
}
{
  "schema": [
    {
      "name": "city",
      "type": "text"
    },
    {
      "name": "AVG(t1.balance)",
      "alias": "avgBal",
      "type": "double"
    }
  ],
  "datarows": [
    [
      null,
      11000.0
    ],
    [
      null,
      25000.0
    ]
  ],
  "total": 2,
  "size": 2,
  "status": 200
}

dai-chen commented 1 month ago

https://github.com/opensearch-project/sql/issues/2873

opensearch-project / sql

JOIN support improvement #49