Closed qianheng-aws closed 2 days ago
High level question: why the solution changed to add aliases instead of distinguishing by struct_col.field1
and struct_col2.field1
? The second solution should be more graceful.
source=table | flatten struct as subfield
seems quite confused.
I'd like to change the PPL design to
source=table | flatten struct rename origin_subfield1 as new_subfield1, origin_subfield1 as new_subfield1,...
even in the alias solution.
@qianheng-aws Please keep this PR DRAFT before PPL design review is passed.
High level question: why the solution changed to add aliases instead of distinguishing by
struct_col.field1
andstruct_col2.field1
? The second solution should be more graceful.
We cannot do that in parsing phase. That's say, we don't know the actual fields inner the struct field. So I give up that solution and choose to support alias syntax instead of giving alias automatically inner our parser, then users can still do that to avoid duplicate column name in their PPL.
And as said in the issue https://github.com/opensearch-project/opensearch-spark/issues/911#issuecomment-2478550019, it's actually a common issue for asyn-query not just for flatten. I think maybe support alias is a more appropriate way to address such issue.
source=table | flatten struct as subfield
seems quite confused. I'd like to change the PPL design tosource=table | flatten struct rename origin_subfield1 as new_subfield1, origin_subfield1 as new_subfield1,...
even in the alias solution.
Then we need to figure out another way to pass that fields mapping to Spark operator Generate
. Unfortunately, Spark doesn't have such mechanism, it only support make alias to the original operator output in sequence.
source=table | flatten struct as subfield
seems quite confused. I'd like to change the PPL design tosource=table | flatten struct rename origin_subfield1 as new_subfield1, origin_subfield1 as new_subfield1,...
even in the alias solution.Then we need to figure out another way to pass that fields mapping to Spark operator
Generate
. Unfortunately, Spark doesn't have such mechanism, it only support make alias to the original operator output in sequence.
Ok, I see now. So for the column struct_col STRUCT<field1: STRUCT<subfield:STRING>, field2: INT>
case, we must add aliases sequence such as | flatten struct_col as (field1_1, field2_2)
.
The basic approach works for me. Only some minor comments. @YANG-DB please take a look.
Description
Support flatten with alias. e.g.
Related Issues
Resolve #911
We can use alias to avoid duplicate column name in the final result.
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.