Open mdesmet opened 1 year ago
cc @martint @kasiafi
@mdesmet can you present this in a DBT-agnostic manner? some people involved in this project aren't fluent with DBT, so it would help them understand your ask better. Especially that not everyone knows how to expand the templating language being used in the current issue description.
I think the issue quite self explanatory. Natural join is a standard SQL feature and is well described in the SQL specification, so there’s little room for interpretation.
Yes, but there is a code snippet that looks as attempting to show why and where it's beneficial over what we support today.
where it's beneficial over what we support today.
Other than for language completeness, I don't think there's much benefit to natural joins. In general, they should be avoided, as they can inadvertently start producing unexpected results if, for instance, a new column is added to one of the underlying tables.
The only case I can think of where it makes sense is if you're joining the table to itself, as in the query above.
For dedupe specifically the simplest impl is:
SELECT * FROM table EXCEPT ALL SELECT DISTINCT * FROM table;
To be honest I've never had a case where a NATURAL JOIN couldn't be converted to something more appropriate to task at hand. The only motivation to add it (in my opinion) would be for SQL spec coverage.
The macro is some kind of templated function that can be applied within another query. That means that the context in which this query snippet will be used is unknown and only known when the query is executed, in other words the exact columns of the input relation
is unknown within the bounds of this templated function.
As you can see there is also a partition_by and order_by input that is used to select which record to keep in case of duplicates.
The EXCEPT ALL
solution can't be used as when a partition_by is given, the additional columns of the input would not be identifiable. It would work though if no partition_by is supplied.
This can only be solved with a NATURAL JOIN
as the NATURAL JOIN
defers the matched columns to the engine instead of requiring the user to input them.
In the case of the query above, why do we need a Join? Can't we just do a window and then filter, and then distinct?
Can't we just do a window and then filter, and then distinct?
The disadvantage with that approach it leaks the row number in the result set. It is something the caller of this macro has to work around to remove this column compared to the NATURAL JOIN. It is obviously better for performance compared to the self join.
Currently NATURAL JOIN is not supported in Trino
https://github.com/trinodb/trino/blob/f09633641822f659c1a9e3ddbefa3c5937dffcdc/core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java#L2843
The NATURAL JOIN is used in ETL frameworks to implement deduplication.
For example see this dbt snippet:
I think the natural join could be implemented through rewriting the query with a ON expression on all common columns.