opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
16 stars 26 forks source link

[FEATURE] Support aggregate functions in Eval expressions #755

Open LantaoJin opened 2 days ago

LantaoJin commented 2 days ago

What is the bug? Aggregate functions could work in select clause even there is no group by as long as all items in select are aggregate functions. Here are some examples:

select max(a) from t
select max(a), min(a), count(a) from t

But the following queries should throw exceptions

select a, max(a) from t
select a, max(a), min(a), count(a) from t
select *, max(a), min(a), count(a) from t

They could work with a group by

select a, max(a) from t group by b
select a, max(a), min(a), count(a) from t group by b
select *, max(a), min(a), count(a) from t group by b

Similar, aggregate functions in PPL could work in eval command, because an eval expression equals to add a projection to existing project list. Here are some examples:

source=t | eval m = max(a) | fields m
source=t | eval m = max(a), n = count(a) | fields m, n
source=t | eval m = max(a) | n = count(a) | fields m, n

But the following PPL queries should throw exceptions

source=t | eval m = max(a) -- equals to all fields plus m
source=t | eval m = max(a), n = count(a) | fields a, m, n
source=t | eval m = max(a) | n = count(a) | fields a, m, n

How can one reproduce the bug? The query failed with syntax error:

source=t | eval m = max(a) | fields m
LantaoJin commented 1 day ago

I found the current draft PR still has some problem. And more further, it is sorts of tricky to provide this enhancement. As we know, eval command will add new columns to a plan by adding a Spark Project node. More specific, one comma separated eval expression will add Project(*, 'a, 'b, 'c, child), multiple eval expressions will add

Project(*, 'a, child)
+- Project(*, 'b, child)
     +- Project(*, 'c, child)

This plan will be optimized to Project(*, 'a, 'b, 'c, child) in Spark optimizer. Note the * we added in the first project list is to make sure the attribute reference could be resolved since the plan we build is unresolved logical plan. It will be analyzed later. For example, the attribute reference x cannot be resolved if we remove the * in the Project nodes added by eval commands.

Project('x, child)
+- Project('a, child)
    +- Project('b, child)
         +- Project('c, child)

Now, this feature is arm to support aggregate functions in eval expression. Give an example, the command:

| eval max_a = max(a) | eval min_a = min(a) | eval count_a = count(a) | ...

Will build a plan like this:

Project(*, 'max(a), child)
+- Project(*, 'min(a), child)
     +- Project(*, 'count(a), child)

Unfortunately, Spark analyzer will resolve this Project node as translating to Aggregate:

Aggregate([], [*, 'max(a)], child)
+- Aggregate([], [*, 'min(a)], child)
     +- Aggregate([], [*, 'count(a)], child)

Then the resolution will fail because aggregate expression [*, 'max(a)] contains a non-aggregate function and grouping expression [] is empty.

So the only successful case is putting one comma separated eval expression in the second to last position:

| eval max_a = max(a), min_a = min(a), count_a = count(a) | fields max_a, min_a, count_a

This design is tricky and anti-robust, and even it could be rewrite to

| stats max(a) as max_a, min(a) as min_a, count(a) as count_a

So I think we should not continue this feature since it should work by stats. @YANG-DB @penghuo how do you think?

YANG-DB commented 19 hours ago

@LantaoJin thanks for this deep analysis - I agree that if we can support the same functionality in stats we can ignore this issue for now - just please document this so that we have it in a common place and readers wont need to search in our issue to understand why ...

LantaoJin commented 18 hours ago

@LantaoJin thanks for this deep analysis - I agree that if we can support the same functionality in stats we can ignore this issue for now - just please document this so that we have it in a common place and readers wont need to search in our issue to understand why ...

Sure, I will give a doc PR later.