opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
22 stars 33 forks source link

Support flatten with alias #927

Closed qianheng-aws closed 2 days ago

qianheng-aws commented 4 days ago

Description

Support flatten with alias. e.g.

source=table | flatten coor as (altitude, latitude, longitude)

source=table | flatten struct as subfield 

Related Issues

Resolve #911

We can use alias to avoid duplicate column name in the final result.

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

LantaoJin commented 4 days ago

High level question: why the solution changed to add aliases instead of distinguishing by struct_col.field1 and struct_col2.field1? The second solution should be more graceful.

LantaoJin commented 4 days ago

source=table | flatten struct as subfield seems quite confused. I'd like to change the PPL design to source=table | flatten struct rename origin_subfield1 as new_subfield1, origin_subfield1 as new_subfield1,... even in the alias solution.

LantaoJin commented 4 days ago

@qianheng-aws Please keep this PR DRAFT before PPL design review is passed.

qianheng-aws commented 4 days ago

High level question: why the solution changed to add aliases instead of distinguishing by struct_col.field1 and struct_col2.field1? The second solution should be more graceful.

We cannot do that in parsing phase. That's say, we don't know the actual fields inner the struct field. So I give up that solution and choose to support alias syntax instead of giving alias automatically inner our parser, then users can still do that to avoid duplicate column name in their PPL.

And as said in the issue https://github.com/opensearch-project/opensearch-spark/issues/911#issuecomment-2478550019, it's actually a common issue for asyn-query not just for flatten. I think maybe support alias is a more appropriate way to address such issue.

qianheng-aws commented 4 days ago

source=table | flatten struct as subfield seems quite confused. I'd like to change the PPL design to source=table | flatten struct rename origin_subfield1 as new_subfield1, origin_subfield1 as new_subfield1,... even in the alias solution.

Then we need to figure out another way to pass that fields mapping to Spark operator Generate. Unfortunately, Spark doesn't have such mechanism, it only support make alias to the original operator output in sequence.

LantaoJin commented 4 days ago

source=table | flatten struct as subfield seems quite confused. I'd like to change the PPL design to source=table | flatten struct rename origin_subfield1 as new_subfield1, origin_subfield1 as new_subfield1,... even in the alias solution.

Then we need to figure out another way to pass that fields mapping to Spark operator Generate. Unfortunately, Spark doesn't have such mechanism, it only support make alias to the original operator output in sequence.

Ok, I see now. So for the column struct_col STRUCT<field1: STRUCT<subfield:STRING>, field2: INT> case, we must add aliases sequence such as | flatten struct_col as (field1_1, field2_2).

LantaoJin commented 4 days ago

The basic approach works for me. Only some minor comments. @YANG-DB please take a look.