Open dai-chen opened 3 years ago
JOIN with nested loops join algorithm in new engine PoC: https://github.com/chloe-zh/opensearch-sql/tree/poc/join
Requirements in observability: https://github.com/opensearch-project/sql/issues/892
Another option we can explore is to reuse legacy JOIN implementation in v2 engine directly by:
ExprValueRow
wraps an ExprValue
into legacy Row
interface.LegacyToV2Adapter
adapts legacy join operator to behave like a V2 operator.V2ToLegacyAdapter
adapts V2 operator for integrating with legacy join operator.The main blocker to make this happen is v2 engine doesn't support symbol resolution very well. For example, SELECT t1.name FROM index t1 ...
, alias t1
will be removed after analysis. There is no alias in DSL query and value environment. Finally the value is named t1.name
by project operator. This will cause problem in analysis and execution in JOIN case which has 2 or more table aliased (See the column names and null value in result below). Anyway, this is something to improve in v2 engine itself instead of problem in this quick workaround.
Test branch: https://github.com/dai-chen/sql-1/tree/test-legacy-join-port
PUT account-test/_doc/1
{
"name": "John",
"balance": 12000
}
PUT account-test/_doc/2
{
"name": "John",
"balance": 10000
}
PUT account-test/_doc/3
{
"name": "Allen",
"balance": 25000
}
PUT customer-test/_doc/1
{
"name": "John",
"age": 25,
"city": "Seattle"
}
PUT customer-test/_doc/2
{
"name": "Allen",
"age": 30,
"city": "Portland"
}
POST _plugins/_sql
{
"query": """
SELECT t1.name, t1.balance, t2.age, t2.city
FROM account-test t1
JOIN customer-test t2
ON t1.name = t2.name
"""
}
{
"schema": [
{
"name": "name",
"type": "text"
},
{
"name": "balance",
"type": "long"
},
{
"name": "age",
"type": "long"
},
{
"name": "city",
"type": "text"
}
],
"datarows": [
[
"John",
10000,
25,
"Seattle"
],
[
"John",
12000,
25,
"Seattle"
],
[
"Allen",
25000,
30,
"Portland"
]
],
"total": 3,
"size": 3,
"status": 200
}
POST _plugins/_sql
{
"query": """
SELECT
t2.city,
AVG(t1.balance) AS avgBal
FROM account-test t1
JOIN customer-test t2
ON t1.name = t2.name
GROUP BY t2.city
ORDER BY avgBal
"""
}
{
"schema": [
{
"name": "city",
"type": "text"
},
{
"name": "AVG(t1.balance)",
"alias": "avgBal",
"type": "double"
}
],
"datarows": [
[
null,
11000.0
],
[
null,
25000.0
]
],
"total": 2,
"size": 2,
"status": 200
}
Is your feature request related to a problem? Please describe.
There is limited support for JOIN queries which needs improvement. In particular, post processing such as filtering and aggregation after inner/outer join is most wanted. This capability is required by PPL lookup command.
Here are the feature requests from community:
Describe the solution you'd like N/A
Describe alternatives you've considered N/A
Additional context Add any other context or screenshots about the feature request here.