pacificclimate / pdp_util

A package supplying numerous apps for running PCIC's data server
GNU General Public License v3.0
0 stars 0 forks source link

Natural joins in some queries are likely invalid now #49

Closed rod-glover closed 1 year ago

rod-glover commented 1 year ago

Natural joins use all columns that are common between the two tables being joined.

Since we added column publish to meta_station, it's now common between meta_station AND meta_network and might be being used as a join condition! Not good. These joins need to be made with explicit join conditions.

More generally, natural joins -- as opposed to joins with ON conditions -- are not robust to future changes in the database. Even though it's hard to imagine in some cases (e.g., Network join Variable), all natural joins should be replaced with joins with specific ON conditions. This is not hard, and there are not that many such cases.

Known cases:

agg.py

    rv = (
        sesh.query(Variable)
        .join(Network)
        .filter(Network.name == network)
        .filter(climo_filt)
    )

pcds_index.py

        query = (
            sesh.query(Network.name, Network.long_name)
            .join(Variable)
            .join(VarsPerHistory)
            .distinct()
            .order_by(Network.name)
        )
            query = (
                sesh.query(Station.native_id, History.station_name)
                .join(History)
                .join(Network)
                .join(VarsPerHistory)
                .join(Variable)
                .filter(Network.name == network_name)
                .filter(
                    or_(
                        Variable.cell_method.contains("within"),
                        Variable.cell_method.contains("over"),
                    )
                )
                .distinct()
                .order_by(Station.native_id)
            )
            query = (
                sesh.query(Station.native_id, History.station_name)
                .join(History)
                .join(Network)
                .join(VarsPerHistory)
                .join(Variable)
                .filter(Network.name == network_name)
                .filter(
                    not_(
                        or_(
                            Variable.cell_method.contains("within"),
                            Variable.cell_method.contains("over"),
                        )
                    )
                )
                .distinct()
                .order_by(Station.native_id)
            )

raster.py

            for Table in joins:
                q = q.join(Table)

where joins is a tuple of tables, without ON clauses.

util.py

    q = sesh.query(*to_select).join(Station).join(Network)
rod-glover commented 1 year ago

Except according to SQLAlchemy documentation, .join is clever, and infers the ON clause for the join using only foreign keys.

In the above calling form, Select.join() is called upon to infer the “on clause” automatically. This calling form will ultimately raise an error if either there are no ForeignKeyConstraint setup between the two mapped Table constructs, or if there are multiple ForeignKeyConstraint linakges between them such that the appropriate constraint to use is ambiguous.

This means that any other common columns are ignored, and that SQLAlchemy will raise an error if it cannot infer a suitable ON using foreign keys. Just what the doctor ordered.

rod-glover commented 1 year ago

Closing this because very likely unnecessary. Will leave the branch intact.