ossc-db / pg_hint_plan

Extension adding support for optimizer hints in PostgreSQL
Other
696 stars 103 forks source link

WA for issue with ignoring hints after catch any exception #100

Closed WinterUnicorn closed 1 year ago

WinterUnicorn commented 2 years ago

We have an issues related to the invalid value of plpgsql_recurse_level counter in the PG13 and current master branch (Postgres 14).

The issue appears when we catch and handle any error inside any function/procedure. In this case the counter "plpgsql_recurse_level" becomes invalid because the method "pg_hint_plan_plpgsql_stmt_end" was not called. Therefore after catch any error until the end of the transaction we have the following issues:

PG13: pg_hint_plan does not process new hints for new requests PG14 (master): pg_hint_plan applies old hints from old queries to the new queries.

We investigated this issue and we were able to fix it using two options:

1) Change logic of counting nested levels and use ExecutorRun/ExecutorFinish hooks instead of pg_hint_plan_plpgsql_stmt_beg/pg_hint_plan_plpgsql_stmt_end hooks. Like in the original postgres code in the pg_stat_statement.c

2) Add flag to ignore plpgsql_recurse_level and parse hints for every statement independently of the level.

Option 1 - will change the output for one of the regression tests and require more code changes. Option 2 - we lost the ability to "push" a hint into the first statement inside function from the outside. This is can be WA by moving hint inside function itself.

This pull request is implementation of the "option 2" because in this case with disabled flag we don't change behaviour of old regression tests if someone hooked up with it. We add GUC property to ingore plpgsql_recurse_level

Here is cases to reproduce issues:

For PG13 branch + Postgresql 13

-- PG13
begin;
create or replace function pg_temp.func_with_exception(p_var varchar) returns numeric language plpgsql as 
$$
begin 
    return p_var::numeric;
exception when others then null;
    return -1;
end;
$$;

select pg_temp.func_with_exception('abc');
explain (COSTS false) /*+MergeJoin(d1 d2) */ with dual as (select 'x' as dummy) select * from dual d1, dual d2 where d1.dummy = d2.dummy;
-- Hint will be ignored
drop function pg_temp.func_with_exception(varchar);
rollback;

For master branch + Postgresql 14

-- PG14 case
begin;
select set_config('compute_query_id','off', true);

create or replace function pg_temp.test_bug_hint(p_var varchar) returns  numeric language plpgsql as 
$$
declare 
begin 
    return p_var::numeric;
exception when others then return 1;
end;
$$;

select pg_temp.test_bug_hint('x');
-- Counter of nested level was changed to 1 due to invalid tracking during exception

explain with test /*+ MergeJoin(t1 t2) */ as (select 'x' val) select t1.val from test t1, test t2 where t1.val  = t2.val;
explain with test2 as (select 'y'  val) select t1.val from test t1, test t2 where t1.val <> t2.val;
-- MergeJoin will be applied to the both queries because current_hint_str wasn't cleanup due to the current nested level logic
rollback;
Lerm commented 1 year ago

This issue is still valid for both PG14 and PG15.

Updated test case to reproduce the issue on PG14 and PG15 (fixed table name / condition in the original TC):

create or replace function pg_temp.test_bug_hint(p_var varchar) returns numeric language plpgsql as 
$$
begin 
 return p_var::numeric;
exception when others then return 1;
end;
$$;

begin;
select set_config('compute_query_id','off', true);

-- Show plan without hints
explain with test as (select 'z' val) select t1.val from test t1, test t2 where t1.val = t2.val; 
-- Invoke function which throws and catch exception inside
select pg_temp.test_bug_hint('x');
-- Counter of nested level was changed to 1 due to invalid tracking during exception

-- Show lan with MergeJoin hint
explain with test /*+ MergeJoin(t1 t2) */ as (select 'x' val) select t1.val from test t1, test t2 where t1.val = t2.val;
-- Next query is expected to have the same plan as the first one, but MergeJoin hint from the last query is erroneously applied to it
explain with test as (select 'y' val) select t1.val from test t1, test t2 where t1.val = t2.val;
-- MergeJoin is applied to both queries because current_hint_str wasn't cleared due to the current nested level logic
rollback;

Sample output on PG15 with 'debug_print' set to 'verbose' and 'message_level' to 'warning':

test=# \i test.sql
CREATE FUNCTION
BEGIN
psql:test.sql:10: WARNING:  hints in comment="(none)", query="select set_config('compute_query_id','off', true);", debug_query_string="select set_config('compute_query_id','off', true);"
psql:test.sql:10: WARNING:  pg_hint_plan[qno=0x0]: planner: no valid hint
 set_config
------------
 off
(1 row)

psql:test.sql:13: WARNING:  hints in comment="(none)", query="explain with test as (select 'z' val) select t1.val from test t1, test t2 where t1.val = t2.val;", debug_query_string="explain with test as (select 'z' val) select t1.val from test t1, test t2 where t1.val = t2.val;"
psql:test.sql:13: WARNING:  pg_hint_plan[qno=0x1]: planner: no valid hint
                          QUERY PLAN
--------------------------------------------------------------
 Nested Loop  (cost=0.01..0.06 rows=1 width=32)
   Join Filter: (t1.val = t2.val)
   CTE test
     ->  Result  (cost=0.00..0.01 rows=1 width=32)
   ->  CTE Scan on test t1  (cost=0.00..0.02 rows=1 width=32)
   ->  CTE Scan on test t2  (cost=0.00..0.02 rows=1 width=32)
(6 rows)

psql:test.sql:15: WARNING:  hints in comment="(none)", query="select pg_temp.test_bug_hint('x');", debug_query_string="select pg_temp.test_bug_hint('x');"
psql:test.sql:15: WARNING:  pg_hint_plan[qno=0x2]: planner: no valid hint
psql:test.sql:15: WARNING:  hints in comment="(none)", query="p_var::numeric", debug_query_string="select pg_temp.test_bug_hint('x');"
psql:test.sql:15: WARNING:  pg_hint_plan[qno=0x3]: planner: no valid hint
psql:test.sql:15: WARNING:  hints in comment="(none)", query="1", debug_query_string="select pg_temp.test_bug_hint('x');"
psql:test.sql:15: WARNING:  pg_hint_plan[qno=0x4]: planner: no valid hint
 test_bug_hint
---------------
             1
(1 row)

psql:test.sql:19: WARNING:  hints in comment="MergeJoin(t1 t2) ", query="explain with test /*+ MergeJoin(t1 t2) */ as (select 'x' val) select t1.val from test t1, test t2 where t1.val = t2.val;", debug_query_string="explain with test /*+ MergeJoin(t1 t2) */ as (select 'x' val) select t1.val from test t1, test t2 where t1.val = t2.val;"
psql:test.sql:19: WARNING:  pg_hint_plan[qno=0x5]: planner
psql:test.sql:19: WARNING:  pg_hint_plan[qno=0x5]: HintStateDump: {used hints:MergeJoin(t1 t2)}, {not used hints:(none)}, {duplicate hints:(none)}, {error hints:(none)}
                             QUERY PLAN
--------------------------------------------------------------------
 Merge Join  (cost=0.07..0.10 rows=1 width=32)
   Merge Cond: (t1.val = t2.val)
   CTE test
     ->  Result  (cost=0.00..0.01 rows=1 width=32)
   ->  Sort  (cost=0.03..0.04 rows=1 width=32)
         Sort Key: t1.val
         ->  CTE Scan on test t1  (cost=0.00..0.02 rows=1 width=32)
   ->  Sort  (cost=0.03..0.04 rows=1 width=32)
         Sort Key: t2.val
         ->  CTE Scan on test t2  (cost=0.00..0.02 rows=1 width=32)
(10 rows)

psql:test.sql:21: WARNING:  hints in comment="(none)", query="explain with test as (select 'y' val) select t1.val from test t1, test t2 where t1.val = t2.val;", debug_query_string="explain with test as (select 'y' val) select t1.val from test t1, test t2 where t1.val = t2.val;"
psql:test.sql:21: WARNING:  pg_hint_plan[qno=0x6]: planner
psql:test.sql:21: WARNING:  pg_hint_plan[qno=0x6]: HintStateDump: {used hints:MergeJoin(t1 t2)}, {not used hints:(none)}, {duplicate hints:(none)}, {error hints:(none)}
                             QUERY PLAN
--------------------------------------------------------------------
 Merge Join  (cost=0.07..0.10 rows=1 width=32)
   Merge Cond: (t1.val = t2.val)
   CTE test
     ->  Result  (cost=0.00..0.01 rows=1 width=32)
   ->  Sort  (cost=0.03..0.04 rows=1 width=32)
         Sort Key: t1.val
         ->  CTE Scan on test t1  (cost=0.00..0.02 rows=1 width=32)
   ->  Sort  (cost=0.03..0.04 rows=1 width=32)
         Sort Key: t2.val
         ->  CTE Scan on test t2  (cost=0.00..0.02 rows=1 width=32)
(10 rows)

ROLLBACK

Expected result: query 1 and 3 should have the same execution plan (as they both have no hints specified). Actual result: query 2 and 3 have the same execution plan (as hint from query 2 has leaked to the query 3). While query 3 has no hints specified, but its 'HintStateDump' output shows 'used hints:MergeJoin(t1 t2)' message.

michaelpq commented 1 year ago

This issue is still valid for both PG14 and PG15.

Yes, we are looking into that. I am surprised that the proposed patch introduces a GUC. This looks like a bug, and giving the option to users to hide it does not make much sense if it is just plain broken.

MasahikoSawada commented 1 year ago

This looks like a bug, and giving the option to users to hide it does not make much sense if it is just plain broken.

+1

ISTM the root cause of this problem is that stmt_end() callback is not called if an error raised during execution is caught. I guess It could be a problem in Postgres core.

michaelpq commented 1 year ago

ISTM the root cause of this problem is that stmt_end() callback is not called if an error raised during execution is caught. I guess It could be a problem in Postgres core.

Ohh. Interesting..

MasahikoSawada commented 1 year ago

I'll ask this on -hackers.

michaelpq commented 1 year ago

The question has been asked by @MasahikoSawada on the upstream lists as of https://www.postgresql.org/message-id/CAD21AoDXeTOR-j1dfSD9H-nf60C8m2DvFOzh%2BC516WuPv9MAzg%40mail.gmail.com.

And the answer is "you are doing it wrong, so do something by yourself". So the path to take is clear.

Lerm commented 1 year ago

So, the problem is that the 'pg_hint_plan_plpgsql_stmt_end' is not invoked in the case of an error - and it seems that such behavior was faced before and the original commit b752a9574cfe8b6d21ac2f7a4958e5e9eab71f3e was designed to be a 'safenet' at the end of transaction. However, it doesn't cover situation in which the exception is intercepted and transaction continue its execution. Thinking of the way to extend this fix for the case described in current issue - maybe we could extend 'plpgsql_query_erase_callback' to cover this case as well? Using something like this:

void plpgsql_query_erase_callback(ResourceReleasePhase phase,
                                  bool isCommit,
                                  bool isTopLevel,
                                  void *arg)
{
    if (phase != RESOURCE_RELEASE_AFTER_LOCKS)
        return;
    if (isTopLevel)
        {
        /* Cancel plpgsql nest level*/
        plpgsql_recurse_level = 0;
    }
    else if (!isCommit && plpgsql_recurse_level > 0)
    {
        /* PLpgSQL doesn't invoke stmt_end in case of exception */
        plpgsql_recurse_level--;
    }
}

This change pass both existing regress testing and the simple test case shown in previous comments. However, I'm unsure that it will work correctly in case of deep nesting (i.e. if we need to decrease level not by one) - need to check such cases.

Also, could we clearly specify - why there is a need to tracking for nesting level in the 'pg_hint_plan' in the first place? Looking at the regression tests -- it seems to be covered by A13 set of tests, but the entire logic is not fully clear for me. The original PR was making two proposals:

However, both these variants change outputs of some tests in the A13 section. I was assuming that tests just need to be updated, but looking on the actual plans with 'auto_explain' - they seems to be really incorrect with either of this change, so it seems that there are some corner cases which really relies on the current behavior.

michaelpq commented 1 year ago

This change pass both existing regress testing and the simple test case shown in previous comments. However, I'm unsure that it will work correctly in case of deep nesting (i.e. if we need to decrease level not by one) - need to check such cases.

Yeah, I was wondering whether we could do something like what you are suggesting here and extend the clear callback to lower by one rank when we are in a subtransaction context. Now, I have to admit that registering a resowner when loading the module sounds pretty much wrong to me, because we live with the workload of going through it for all queries. I have more experience with keeping these short-lived, TBH, and I am really wondering if it would be better to register that stuff only when we know that going to recurse to get the query hints from the inner queries.

michaelpq commented 1 year ago

Also, could we clearly specify - why there is a need to tracking for nesting level in the 'pg_hint_plan' in the first place? Looking at the regression tests -- it seems to be covered by A13 set of tests, but the entire logic is not fully clear for me.

This part is equally unclear to me. I have been playing around with that and I have noticed that it is fairly easy to cause the counter to turn negative when using a plpgsql function that triggers its own commits or rollback, like that:

create table test_hint_tab (a int);
create procedure test_hint_transaction()
language plpgsql
as $$
begin
    for i in 0..9 LOOP
        insert into test_hint_tab (a) values (i);
        if i % 2 = 0 then
            commit;
        else
            rollback;
        end if;
    end loop;
end;
$$;

Coupled with internal queries that use their own hints, it would be easy to cause various incorrect behaviors.

This comes down to the fact that we go through the erase callback twice even on a simple commit (end of transaction and portal drop), so I'd like to agree with you about the fact that the approach using a resowner callback should be kicked out into orbit. Presumably, a safe design should be that we should always decrement plpgsql_recurse_level only when it is strictly positive, meaning that the final result should have assertions looking after that.

michaelpq commented 1 year ago

So, I have spent more time on that, and it looks like I have been able to get down to a simple solution by tweaking the resowner callback. In the long-term, we should really get out of this set of problems by using something else for the cleanup of the recursion level.. I have added a large set of regression tests to cover all the cases I could think about, with a mix of nested transactions, exceptions and functions using commit/rollback internally.