mabel-dev / opteryx

🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.
https://opteryx.dev
Apache License 2.0
66 stars 11 forks source link

🪲 Predicate pushdowns into JOINs don't appear to be working #1436

Open joocer opened 9 months ago

joocer commented 9 months ago

TPCH query 08

select
    nation,
    o_year,
    sum(amount) as sum_profit
from
    (
        select
            n_name as nation,
            year(o_orderdate) as o_year,
            l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
        from
            data.tpch.part,
            data.tpch.supplier,
            data.tpch.lineitem,
            data.tpch.partsupp,
            data.tpch.orders,
            data.tpch.nation
        where
            s_suppkey = l_suppkey
            and ps_suppkey = l_suppkey
            and ps_partkey = l_partkey
            and p_partkey = l_partkey
            and o_orderkey = l_orderkey
            and s_nationkey = n_nationkey
            and p_name like '%plum%'
    ) as profit
group by
    nation,
    o_year
order by
    nation,
    o_year desc;

pre-optimization

└─ EXIT
   └─ ORDER BY (nation, o_year DESC)
      └─ PROJECT (nation, o_year, SUM(amount))
         └─ AGGREGATE (SUM(amount)) GROUP BY (nation, o_year)
            └─ SUBQUERY AS profit
               └─ PROJECT (nation, YEAR(o_orderdate), l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity)
                  └─ FILTER (s_suppkey = l_suppkey AND ps_suppkey = l_suppkey AND ps_partkey = l_partkey AND p_partkey = l_partkey AND o_orderkey = l_orderkey AND s_nationkey = n_nationkey AND p_name LIKE '%plum%')
                     └─ CROSS JOIN
                        ├─ SCAN (data.tpch.part)
                        ├─ SCAN (data.tpch.supplier)
                        ├─ SCAN (data.tpch.lineitem)
                        ├─ SCAN (data.tpch.partsupp)
                        ├─ SCAN (data.tpch.orders)
                        └─ SCAN (data.tpch.nation)

post optimization

└─ EXIT
   └─ ORDER BY (nation, o_year)
      └─ PROJECT (SUM(amount), nation, o_year)
         └─ AGGREGATE (SUM(amount)) GROUP BY (nation, o_year)
            └─ SUBQUERY AS profit
               └─ PROJECT (YEAR(o_orderdate), l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity, nation)
                  └─ FILTER (p_name LIKE '%plum%')
                     └─ FILTER (s_nationkey = n_nationkey)
                        └─ FILTER (o_orderkey = l_orderkey)
                           └─ FILTER (p_partkey = l_partkey)
                              └─ FILTER (ps_partkey = l_partkey)
                                 └─ FILTER (ps_suppkey = l_suppkey)
                                    └─ FILTER (s_suppkey = l_suppkey)
                                       └─ CROSS JOIN
                                          ├─ SCAN (data.tpch.part) [p_partkey, p_name]
                                          ├─ SCAN (data.tpch.supplier) [s_suppkey, s_nationkey]
                                          ├─ SCAN (data.tpch.lineitem) [l_orderkey, l_partkey, l_suppkey, l_quantity, l_extendedprice, l_discount]
                                          ├─ SCAN (data.tpch.partsupp) [ps_partkey, ps_suppkey, ps_supplycost]
                                          ├─ SCAN (data.tpch.orders) [o_orderkey, o_orderdate]
                                          └─ SCAN (data.tpch.nation) [n_nationkey, n_name]

the optimizer has pushed the projections to the SCANs, but has not pushed the predicates into the JOIN, maybe because it's a 6-way cross join.

Note this query ultimately fails after running for a few minutes because of the YEAR function call not handling null entries.

joocer commented 1 week ago

a simpler example:

SELECT * FROM $planets, $satellites WHERE mass = gm

creates this plan:

   └─ FILTER (mass = gm)
      └─ CROSS JOIN 
         ├─ READ ($planets) [id, name, mass, diameter, density, gravity, escapeVelocity, rotationPeriod, lengthOfDay, distanceFromSun, perihelion, aphelion, orbitalPeriod, orbitalVelocity, orbitalInclination, orbitalEccentricity, obliquityToOrbit, meanTemperature, surfacePressure, numberOfMoons]
         └─ READ ($satellites) [id, planetId, name, gm, radius, density, magnitude, albedo]