opendistro-for-elasticsearch / sql

🔍 Open Distro SQL Plugin
https://opendistro.github.io/for-elasticsearch/features/SQL%20Support.html
Apache License 2.0
621 stars 186 forks source link

Feature request: support group by field, and filter with left join index's attribute #301

Open seraphjiang opened 4 years ago

seraphjiang commented 4 years ago

Problem

I want to be able to find list of issue, filter by attribute from another left join table.

I have two index, issue and issue_detail, they share the same key issueId. issue_detail has additional attribute - priority i'd like to use to filter out result.

see below example

SQL Query

select i.issueId from [issue] i 
  left join [issue_detail] d 
  on i.issueId = d.issueId 
where d.priority=2
group by i.issueId

OD-SQL Query

GET _opendistro/_sql?format=csv
{
  "query": "select i.issueId from [issue] i left join [issue_detail] d on i.issueId = d.issueId where d.priority=2 group by i.issueId"
}

Actual result:

i.issueId
00003
00002
00001
00004

Expect result:

i.issueId
00002
00004

Testing Data

POST _bulk
{ "index" : { "_index" : "issue", "_id" : "1" } }
{ "issueId" : "00001" }
{ "index" : { "_index" : "issue", "_id" : "2" } }
{ "issueId" : "00002" }
{ "index" : { "_index" : "issue", "_id" : "3" } }
{ "issueId" : "00003" }
{ "index" : { "_index" : "issue", "_id" : "4" } }
{ "issueId" : "00004" }

POST _bulk
{ "index" : { "_index" : "issue_detail", "_id" : "1" } }
{ "issueId":"00001", "priority" : 1 }
{ "index" : { "_index" : "issue_detail", "_id" : "2" } }
{ "issueId":"00002", "priority" : 2 }
{ "index" : { "_index" : "issue_detail", "_id" : "3" } }
{ "issueId":"00003", "priority" : 3 }
{ "index" : { "_index" : "issue_detail", "_id" : "4" } }
{ "issueId":"00004", "priority" : 2 }
chloe-zh commented 4 years ago

Thanks for the report! I have reproduced this output and DSL was printed as follows:

{
  "Physical Plan" : {
    "Project [ columns=[i.issueId] ]" : {
      "Top [ count=200 ]" : {
        "BlockHashJoin[ conditions=( i.issueId = d.issueId ), type=LEFT_OUTER_JOIN, blockSize=[FixedBlockSize with size=10000] ]" : {
          "Scroll [ issue_detail as d, pageSize=10000 ]" : {
            "request" : {
              "size" : 200,
              "query" : {
                "bool" : {
                  "filter" : [
                    {
                      "bool" : {
                        "adjust_pure_negative" : true,
                        "must" : [
                          {
                            "term" : {
                              "priority" : {
                                "boost" : 1,
                                "value" : 2
                              }
                            }
                          }
                        ],
                        "boost" : 1
                      }
                    }
                  ],
                  "adjust_pure_negative" : true,
                  "boost" : 1
                }
              },
              "from" : 0
            }
          },
          "Scroll [ issue as i, pageSize=10000 ]" : {
            "request" : {
              "size" : 200,
              "from" : 0,
              "_source" : {
                "excludes" : [ ],
                "includes" : [
                  "issueId"
                ]
              }
            }
          },
          "useTermsFilterOptimization" : false
        }
      }
    }
  },
  "description" : "Hash Join algorithm builds hash table based on result of first query, and then probes hash table to find matched rows for each row returned by second query",
  "Logical Plan" : {
    "Project [ columns=[i.issueId] ]" : {
      "Top [ count=200 ]" : {
        "Join [ conditions=( i.issueId = d.issueId ) type=LEFT_OUTER_JOIN ]" : {
          "Group" : [
            {
              "Project [ columns=[i.issueId] ]" : {
                "TableScan" : {
                  "tableAlias" : "i",
                  "tableName" : "issue"
                }
              }
            },
            {
              "Project [ columns=[d.issueId] ]" : {
                "Filter [ conditions=[AND ( AND priority EQ 2 ) ] ]" : {
                  "TableScan" : {
                    "tableAlias" : "d",
                    "tableName" : "issue_detail"
                  }
                }
              }
            }
          ]
        }
      }
    }
  }
}

As we can see in the physical plan, the filter was pushed into only the left table, which causes problem to the LEFT JOIN queries with filters. Currently we do not support filter and aggregate on the result of LEFT JOINs due to the limitation of ES. Related issues: #245 #236 #289 #124 #110