Open JamesRTaylor opened 4 years ago
@JamesRTaylor I think there're multiple steps to this one:
1) Prune unnest fields in UNNEST node. This is similar to pruning symbols for single-level nesting. For multi-level nesting, this is more like pushing down dereferences. for example, convert
Project ( p1 := f(A.B.C), p2 := m(K) )
unnest(arr -> {A, E, K})
S(arr)
to
Project(p1 := f(A_B_C), p2 := m(K))
unnest(arr' -> {A_B_C, K})
project(arr' -> transform(arr, el -> row(el.A.B.C, el.K)))
S(arr)
Here E is completely pruned, and A.B.C is extracted from A.
2) Push this array transformation as far down as possible in the plan node so that they can reach TableScan. We might be able to represent it through a new subclass of LambdaExpression to simplify things.
3) Add support for generic function calls in ConnectorExpression. (If this is too involved, may be we can start with creating a connector expression for the new subclass as mentioned above.) Then the ORC/Parquet reader would be able to prune the fields from the array rows.
I feel 3) won't be impactful enough without 1) and 2) for all query shapes. I've a WIP patch for 1) but needs some more work. But my hunch is, that patch, if checked-in alone, may degrade performance since UNNEST operator avoids data copy as much as possible already.
@martint what are your thoughts on this?
Thanks for the write-up, @phd3. I was thinking (1) and (2) plus using a ConnectorExpression that would contain pruned type definitions. I'd guess that the main improvement would come from the Parquet/ORC reader only reading A_B_C instead of all of A (very similar to the recent improvements for other nested data situations). Would it work to use the Variable class with a type that includes only the referenced sub fields?
I'd love to take a look at your patch. I could benchmark it on some production queries on our end if it's far enough along so we get an idea of potential impact.
The current plan to support this is the transformation you described in (1), plus the ability to push down functions (e.g., transform
) and lambda expressions to the connector. Before we can do that, we need to land some of the refactorings we've been making to the way functions are represented in the plan and add some APIs to be able to communicate and obtain information about them between the connector and the engine.
@JamesRTaylor sure. I need to understand the case of NULLs in unnest. i.e. make sure that the number of rows output by unnest(a)
is exactly the same as after pruning a. I'll create a WIP PR by next week.
any update?
Any update on this one?
2024 and most advanced query execution engines still can't read nested data efficiently 🤷♂️
Quick update. I had a stable build, but was continuously refactoring on top of Martin's recent AST/IR changes so I opted to wait on completion of those. I'm not expecting large changes after the latest major PR, so I'm fixing up the new classes and unit tests based on the new IR-friendly syntax.
@martint @Desmeister This feature can be a serious game changer for our main use-case.
We have a very large Iceberg table (~40TB of compressed ingested data per day).
The table's schema includes many structured columns - including multi-level nesting (e.g: ARRAY(ROW(col_a,col_b,..,ARRAY(ROW(col_1,col_2,col_3,...))))
).
Some of the ROW
structs includes hundreds of nested columns, and in total there are a few thousands of nested columns in the table.
When running simple analytical queries with UNNEST
operations against this table, which include only a tiny subset of the nested columns, Trino simply can't handle it - queries always fail with Query exceeded per-node memory limit..
exception.
Enabling fault-tolerant execution for those queries doesn't help as well and they fail after a while.
The only workaround we've come up with so far is 'tricking' Trino into thinking the table only has the subset of the columns required by the query. We do this by creating a Hive external table with the modified schema on top of the Iceberg table's data folder. Running the same queries against the external table is super-fast! However, this workaround has too many caveats and obviously cannot be used in production.
Do you happen to know if this feature is planned to be merged any time soon?
@Desmeister Can you provide ETA for PR?
When CROSS JOIN UNNEST is used to access some columns within an array of rows, the entire row is projected as opposed to only the subset of columns based on the references. For example:
Only
a.x
should be projected, but currently all columns ofa
are projected. For very wide rows, this can be expensive.