prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.92k stars 5.33k forks source link

Predicate pushdown may result in query failure #13009

Open mbasmanova opened 5 years ago

mbasmanova commented 5 years ago

Projection IF (o.orderstatus = 'P', t.x[3], NULL) as y combined with o.orderstatus = 'P' AND y IS NOT NULL filter on top of a join causes pushdown of x[3] IS NOT NULL predicate into t. This may trigger Array subscript out of bounds error if t has records where x has fewer than 3 elements. In case these records don't survive the join, the overall query should pass, but it fails.

To reproduce, add the following to com.facebook.presto.hive.TestHiveIntegrationSmokeTest (or execute queries manually):

    @Test
    public void test()
    {
        assertQuerySucceeds(getQueryRunner().getDefaultSession(),
                "CREATE TABLE test AS " +
                "SELECT\n" +
                "    l.linenumber,\n" +
                "    o.orderkey,\n" +
                "    o.orderstatus,\n" +
                "    IF (o.orderstatus = 'P', ARRAY[1, 2, 4], ARRAY[1]) AS x\n" +
                "FROM lineitem l, orders o\n" +
                "WHERE l.orderkey = o.orderkey");

        assertQuerySucceeds(getQueryRunner().getDefaultSession(),
                "SELECT *\n" +
                "FROM (\n" +
                "    SELECT\n" +
                "        o.orderstatus,\n" +
                "        IF (o.orderstatus = 'P', x[3], NULL) AS y\n" +
                "    FROM test t, orders o\n" +
                "    WHERE t.orderkey = o.orderkey\n" +
                ")\n" +
                "WHERE orderstatus = 'P' AND y IS NOT NULL");

        assertUpdate("DROP TABLE test");
    }

The stacktrace:

com.facebook.presto.spi.PrestoException: Array subscript out of bounds
    at com.facebook.presto.operator.scalar.ArraySubscriptOperator.checkIndex(ArraySubscriptOperator.java:166)
    at com.facebook.presto.operator.scalar.ArraySubscriptOperator.longSubscript(ArraySubscriptOperator.java:95)
    at com.facebook.presto.$gen.PageFilter_20190625_013659_979.filter(Unknown Source)
    at com.facebook.presto.$gen.PageFilter_20190625_013659_979.filter(Unknown Source)
    at com.facebook.presto.operator.project.DictionaryAwarePageFilter.filter(DictionaryAwarePageFilter.java:83)
    at com.facebook.presto.operator.project.PageProcessor.createWorkProcessor(PageProcessor.java:115)
    at com.facebook.presto.operator.project.PageProcessor.process(PageProcessor.java:101)
    at com.facebook.presto.operator.ScanFilterAndProjectOperator.processPageSource(ScanFilterAndProjectOperator.java:287)
    at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:231)

com.facebook.presto.sql.planner.optimizations.PredicatePushDown.Rewriter#processInnerJoin already makes sure not to push down non-deterministic predicates. It may need to block pushdown of predicates that may generate an error or wrap these in try.

CC: @rongrong @highker @wenleix @arhimondr

P.S. The following fix introduced in 0.221 seems to expose this issue (e.g. queries that used to pass are now failing). It affects queries that use varchar columns. E.g. it would happen if type of orderstatus column in the above repro was changed to varchar.

aa6e60648d9d7d8ddedec70e500e89575d177707 is the first bad commit
commit aa6e60648d9d7d8ddedec70e500e89575d177707
Author: Andrii Rosa <andriirosa@fb.com>
Date:   Wed Apr 24 14:21:27 2019 -0400

    Fix equality inference for VARCHAR predicates
rongrong commented 5 years ago

I'll take a look.

wenleix commented 5 years ago

See https://github.com/prestodb/presto/pull/12724 for the PR. Previously, constant pushdown doesn't work for VARCHAR since Presto treats VARCHAR and VARCHAR(x) as different types: https://github.com/prestodb/presto/pull/12656#issuecomment-486070691

We can revert aa6e60648d9d7d8ddedec70e500e89575d177707 if the fix turns out to be diffiuclt. However, would this also be reproduced, say if the query is like the following shape?

SELECT *
FROM (
    SELECT
        o.int_col,
        IF (o.int_col = 42, x[3], NULL) AS y
    FROM test t, orders o
    WHERE t.orderkey = o.orderkey
)
WHERE int_col = 42 AND y IS NOT NULL
mbasmanova commented 5 years ago

@wenleix Wenlei, as you pointed out #12724 only exposed the existing problem. Reverting it helps with some queries, but there can still be queries that incorrectly fail.

rongrong commented 5 years ago

@mbasmanova Seems like the test case you created would fail in 0.220 as well. And the same test fails in 0.219 with "Table hive.tpch.lineitem does not exist". If this is a long existing issue, does it still qualify for a release blocker? If "release blocker" only means to fix regression I think reverting problematic commit is a valid option if fixing is hard. I'll still look into understanding and fixing the root cause in the meanwhile.

mbasmanova commented 5 years ago

@rongrong @highker I don't think this issue qualifies as a release blocker because it seems to exist for a long time. I also don't think we should revert #12724. I suggest we remove release-blocker label and work on a fix on a timeline that makes sense.

mbasmanova commented 5 years ago

CC: @rschlussel @aweisberg

rschlussel commented 5 years ago

Two suggestions of similar solutions we were talking about at lunch today were

  1. push down predicates but also keep the original predicate where it was. The predicate that gets pushed down would be the original filter + any rows that get an error (so sort of like the predicate wrapped in a try) and then after the join we'd have the strict predicate that would throw an exception on errors
  2. Push down predicate, but keep in any rows that get an error along with keeping track of the errors themselves. if after the join is done the error rows are still there, then surface the exception

Both of these would require some work on the execution side. The advantage of 1 is that it's easier to implement. The disadvantage is that it could cause a performance regression to do the filter twice if the predicate is expensive to compute and adds more overhead for the 99% of queries that won't have any errors.

kaikalur commented 5 years ago

Simpler self-contained test case:

WITH T1 AS (
    SELECT
        *
    FROM (
        VALUES
            ('P', 1),
            ('N', 2)
    ) T(x, z)
),
test AS (
    SELECT
        *,
        IF (o.x = 'P', ARRAY[1, 2, 4], ARRAY[1]) AS a
    FROM T1 o
)
SELECT
    *
FROM (
    SELECT
        o.x,
        IF (o.x = 'P', t.a[3], NULL) AS y
    FROM test t,
        T1 o
    WHERE
        t.z = o.z
)
WHERE
    x = 'P'
    AND y IS NOT NULL;
kaikalur commented 5 years ago

As I'm getting a hang of the bug, I see that the culprit is the second part of the final predicate:

y is not null

or any other predicate on y. So to keep things simple, I removed all the NULL and ARRAY and here is a better test.

WITH T1 AS (
    SELECT
        *
    FROM (
        VALUES
            (1),
            (2)
    ) T(k)
),
T2 AS (
    SELECT
        *,
        IF (T.x = 10, 1, 0) AS a
    FROM (
        VALUES
            (10, 1),
            (20, 2)
    ) T(x, k)
)
SELECT
    *
FROM (
    SELECT
        T2.k,
        IF (T2.k = 1, 1/T2.a, 10) AS y
    FROM T1
    LEFT JOIN T2
        ON T1.k = T2.k
)
WHERE
    k = 1
    AND y > 0 -- commenting out this line also make the bug go away
;
kaikalur commented 5 years ago

The fix is to do more pushdown! I think the issue is with k=1 not being pushed down. If we pushdown k = 1 (for outer join, depending on the side we may have to weaken it as: k is null or k = 1), it works fine - tested it manually.

So the idea would be to apply any (derived) conditions on join keys that are pushed down first and then the others - inside out.

rongrong commented 5 years ago

So the idea would be to apply any (derived) conditions on join keys that are pushed down first and then the others - inside out.

I agree. It comes down to the order of the filters pushed down to T2 and whether the pushed down filters need to respect any ordering. If SQL spec doesn't say anything about the relative orders of filters, then this might not be a "bug". Otherwise it's safer to evaluate the constant filter before evaluating the expression where the constant is folded in.

arhimondr commented 5 years ago

If SQL spec doesn't say anything about the relative orders of filters, then this might not be a "bug"

I'm pretty sure the SQL spec does not require filters to be executed in the order specified in the query. That allows SQL engines to employ dynamic filter reordering optimizations that @prestodb/aria team is currently working on. @mbasmanova could you please clarify if I'm wrong?

oerling commented 5 years ago

SQL explicitly denies any guarantees concerning execution order or execution of anything at all. SQL does specify that overflows and divisions by zero and the like get signaled to the user if these are executed. Whether these are executed depends on query plan.

Most engines have the query optimizer decide the order and thus errors in filters are or are not masked depending on filter order. This may shift without warning, however, as data cardinalities evolve.

For systems that do adaptive reordering, it is desirable to make filters delay signalling an error until all filters are evaluated. So a false or null always masks an error regardless of evaluation order. This is not per se required by the SQL spec but makes for a more consistent experience.

When filters are pushed down to a non-Presto engine there can be no guarantees of error behavior.

From: Andrii Rosa notifications@github.com Sent: Friday, August 23, 2019 7:55 AM To: prestodb/presto presto@noreply.github.com Cc: oerling erling@xs4all.nl; Team mention team_mention@noreply.github.com Subject: Re: [prestodb/presto] Predicate pushdown may result in query failure (#13009)

If SQL spec doesn't say anything about the relative orders of filters, then this might not be a "bug"

I'm pretty sure the SQL spec does not require filters to be executed in the order specified in the query. That allows SQL engines to employ dynamic filter reordering optimizations that @prestodb/aria https://github.com/orgs/prestodb/teams/aria team is currently working on. @mbasmanova https://github.com/mbasmanova could you please clarify if I'm wrong?

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/prestodb/presto/issues/13009?email_source=notifications&email_token=AKPPPT63BHD4P2JLIKWFYMDQF722TA5CNFSM4H3DUE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5AOFHQ#issuecomment-524346014 , or mute the thread https://github.com/notifications/unsubscribe-auth/AKPPPT65GMBICPKAUODFB7TQF722TANCNFSM4H3DUE3A . https://github.com/notifications/beacon/AKPPPT5QMR7LPVCVND2DKQ3QF722TA5CNFSM4H3DUE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5AOFHQ.gif

kaikalur commented 5 years ago

Yeah - looks like the following statement from page 77 in the same doc seem to say that we have to apply any literal comparisons before the result of a join is calculated. Basically page 74-81 give guidance on dependencies. I will print it out and read it again 🙂

If <join condition> is specified, AP is an equality AND-component of the <search condition>, one comparand
of AP is a column reference CR, and the other comparand of AP is a <literal>, then let CRC be the counterparts
of CR in R. Let {} denote the empty set. {} ↦ {CRC} is a known functional dependency in R if any of the following conditions is true:
— INNER is specified.
— If LEFT is specified and CR is a column reference to a column in T1.
— If RIGHT is specified and CR is a column reference to a column in T2.
NOTE 59 — An SQL-implementation may also choose to recognize {} -> {CRC} as a known functional dependency if the other
comparand is a deterministic expression containing no column references.
kaikalur commented 5 years ago

Hmm, this may just be a code bug too. For example in my final simplest test case, if in the inner most select if I select T1.k instead of T2.k, the bug doesn't happen. Looking at the code, I see in PredicatePushdown.processInnerJoin, we pushdown to the right using not(in(leftVariables)). I wonder why left fields is different in an inner join.