Open LantaoJin opened 2 days ago
I found the current draft PR still has some problem. And more further, it is sorts of tricky to provide this enhancement.
As we know, eval
command will add new columns to a plan by adding a Spark Project
node.
More specific, one comma separated eval expression will add Project(*, 'a, 'b, 'c, child)
, multiple eval expressions will add
Project(*, 'a, child)
+- Project(*, 'b, child)
+- Project(*, 'c, child)
This plan will be optimized to Project(*, 'a, 'b, 'c, child)
in Spark optimizer.
Note the *
we added in the first project list is to make sure the attribute reference could be resolved since the plan we build is unresolved logical plan. It will be analyzed later. For example, the attribute reference x
cannot be resolved if we remove the *
in the Project nodes added by eval commands.
Project('x, child)
+- Project('a, child)
+- Project('b, child)
+- Project('c, child)
Now, this feature is arm to support aggregate functions in eval
expression. Give an example, the command:
| eval max_a = max(a) | eval min_a = min(a) | eval count_a = count(a) | ...
Will build a plan like this:
Project(*, 'max(a), child)
+- Project(*, 'min(a), child)
+- Project(*, 'count(a), child)
Unfortunately, Spark analyzer will resolve this Project
node as translating to Aggregate
:
Aggregate([], [*, 'max(a)], child)
+- Aggregate([], [*, 'min(a)], child)
+- Aggregate([], [*, 'count(a)], child)
Then the resolution will fail because aggregate expression [*, 'max(a)]
contains a non-aggregate function and grouping expression []
is empty.
So the only successful case is putting one comma separated eval expression in the second to last position:
| eval max_a = max(a), min_a = min(a), count_a = count(a) | fields max_a, min_a, count_a
This design is tricky and anti-robust, and even it could be rewrite to
| stats max(a) as max_a, min(a) as min_a, count(a) as count_a
So I think we should not continue this feature since it should work by stats
.
@YANG-DB @penghuo how do you think?
@LantaoJin thanks for this deep analysis - I agree that if we can support the same functionality in stats
we can ignore this issue for now - just please document this so that we have it in a common place and readers wont need to search in our issue to understand why ...
@LantaoJin thanks for this deep analysis - I agree that if we can support the same functionality in
stats
we can ignore this issue for now - just please document this so that we have it in a common place and readers wont need to search in our issue to understand why ...
Sure, I will give a doc PR later.
What is the bug? Aggregate functions could work in select clause even there is no group by as long as all items in
select
are aggregate functions. Here are some examples:But the following queries should throw exceptions
They could work with a
group by
Similar, aggregate functions in PPL could work in
eval
command, because an eval expression equals to add a projection to existing project list. Here are some examples:But the following PPL queries should throw exceptions
How can one reproduce the bug? The query failed with syntax error: