trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
9.88k stars 2.86k forks source link

Add support for `NATURAL JOIN` #14455

Open mdesmet opened 1 year ago

mdesmet commented 1 year ago

Currently NATURAL JOIN is not supported in Trino

https://github.com/trinodb/trino/blob/f09633641822f659c1a9e3ddbefa3c5937dffcdc/core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java#L2843

The NATURAL JOIN is used in ETL frameworks to implement deduplication.

For example see this dbt snippet:

{%- macro default__deduplicate(relation, partition_by, order_by) -%}

    with row_numbered as (
        select
            _inner.*,
            row_number() over (
                partition by {{ partition_by }}
                order by {{ order_by }}
            ) as rn
        from {{ relation }} as _inner
    )

    select
        distinct data.*
    from {{ relation }} as data
    {#
    -- Not all DBs will support natural joins but the ones that do include:
    -- Oracle, MySQL, SQLite, Redshift, Teradata, Materialize, Databricks
    -- Apache Spark, SingleStore, Vertica
    -- Those that do not appear to support natural joins include:
    -- SQLServer, Trino, Presto, Rockset, Athena
    #}
    natural join row_numbered
    where row_numbered.rn = 1

{%- endmacro -%}

I think the natural join could be implemented through rewriting the query with a ON expression on all common columns.

findepi commented 1 year ago

cc @martint @kasiafi

@mdesmet can you present this in a DBT-agnostic manner? some people involved in this project aren't fluent with DBT, so it would help them understand your ask better. Especially that not everyone knows how to expand the templating language being used in the current issue description.

martint commented 1 year ago

I think the issue quite self explanatory. Natural join is a standard SQL feature and is well described in the SQL specification, so there’s little room for interpretation.

findepi commented 1 year ago

Yes, but there is a code snippet that looks as attempting to show why and where it's beneficial over what we support today.

martint commented 1 year ago

where it's beneficial over what we support today.

Other than for language completeness, I don't think there's much benefit to natural joins. In general, they should be avoided, as they can inadvertently start producing unexpected results if, for instance, a new column is added to one of the underlying tables.

The only case I can think of where it makes sense is if you're joining the table to itself, as in the query above.

hashhar commented 1 year ago

For dedupe specifically the simplest impl is:

SELECT * FROM table EXCEPT ALL SELECT DISTINCT * FROM table;

To be honest I've never had a case where a NATURAL JOIN couldn't be converted to something more appropriate to task at hand. The only motivation to add it (in my opinion) would be for SQL spec coverage.

mdesmet commented 1 year ago

The macro is some kind of templated function that can be applied within another query. That means that the context in which this query snippet will be used is unknown and only known when the query is executed, in other words the exact columns of the input relation is unknown within the bounds of this templated function.

As you can see there is also a partition_by and order_by input that is used to select which record to keep in case of duplicates.

The EXCEPT ALL solution can't be used as when a partition_by is given, the additional columns of the input would not be identifiable. It would work though if no partition_by is supplied.

This can only be solved with a NATURAL JOIN as the NATURAL JOIN defers the matched columns to the engine instead of requiring the user to input them.

findepi commented 1 year ago

In the case of the query above, why do we need a Join? Can't we just do a window and then filter, and then distinct?

mdesmet commented 1 year ago

Can't we just do a window and then filter, and then distinct?

The disadvantage with that approach it leaks the row number in the result set. It is something the caller of this macro has to work around to remove this column compared to the NATURAL JOIN. It is obviously better for performance compared to the self join.