reorg / pg_repack

Reorganize tables in PostgreSQL databases with minimal locks
BSD 3-Clause "New" or "Revised" License
1.89k stars 176 forks source link

Endless recursive call of repack_cleanup_callback() when lock failed #419

Open JohnHien opened 2 months ago

JohnHien commented 2 months ago

Hi all, I used pg_repack 1.5.0 to repack a normal table and I encountered endless recursive call of repack_cleanup_callback() if I failed to lock the table, finally I got a coredump file.

Here are my SQL and shell commands. I'm sorry that you cannot reproduce it in community version of PG and pg_repack because I used it in a cloud database Alibaba Cloud PolarDB-PG. In this db there is a kind of quasi-superuser which has more privileges than normal user but less privileges than real superuser, whose name is POLAR_SUPERUSER. If the user doesn't have the privilege of the table, we can reproduce the error case because we cannot lock the table.

Create normal table and extension with real superuser:

postgres=# create table test(a int primary key);
CREATE TABLE
postgres=# create extension pg_repack ;
CREATE EXTENSION

Create a POLAR_SUPERUSER, grant the privileges of schema repack to it. But it doesn't have the privilege of the normal table.

postgres=# create user test_user with polar_superuser;
CREATE ROLE
postgres=# grant all on schema repack to test_user ;
GRANT
postgres=# grant all on all tables in schema repack to test_user ;
GRANT

Then repack the table with the user created in last step and we can get the error messages.

pg_repack --dbname=postgres --user test_user -k --table=test

The error messages and coredump stack are as follows.

INFO: repacking table "public.test"
ERROR:  permission denied for table test
ERROR: query failed: ERROR:  current transaction is aborted, commands ignored until end of transaction block
DETAIL: query was: RESET lock_timeout
ERROR:  permission denied for table test
ERROR: query failed: ERROR:  current transaction is aborted, commands ignored until end of transaction block
……
……
ERROR: query failed: ERROR:  current transaction is aborted, commands ignored until end of transaction block
DETAIL: query was: RESET lock_timeout
ERROR:  permission denied for table test
ERROR: query failed: ERROR:  current transaction is aborted, commands ignored until end of transaction block
DETAIL: query was: RESET lock_timeout
ERROR:  permission denied for table test
ERROR: query failed: ERROR:  current transaction is aborted, commands ignored until end of transaction block
DETAIL: query was: RESET lock_timeout
Segmentation fault (core dumped)
(gdb) bt
#0  0x00007f1cddc74e46 in pg_snprintf (str=<error reading variable: Cannot access memory at address 0x7ffd1840df78>,
    count=<error reading variable: Cannot access memory at address 0x7ffd1840df70>,
    fmt=<error reading variable: Cannot access memory at address 0x7ffd1840df68>) at snprintf.c:203
#1  0x00007f1cddc6dee8 in getnameinfo_unix (sa=0x11c0678, salen=110, node=0x0, nodelen=0,
    service=0x7ffd1840e150 "\364\221\307\335\034\177", servicelen=1025, flags=2) at ip.c:256
#2  0x00007f1cddc6db7e in pg_getnameinfo_all (addr=0x11c0678, salen=110, node=0x0, nodelen=0,
    service=0x7ffd1840e150 "\364\221\307\335\034\177", servicelen=1025, flags=2) at ip.c:122
#3  0x00007f1cddc4a60c in emitHostIdentityInfo (conn=0x11c0410, host_addr=0x7ffd1840e7f0 "") at fe-connect.c:2023
#4  0x00007f1cddc4c0a7 in PQconnectPoll (conn=0x11c0410) at fe-connect.c:2938
#5  0x00007f1cddc4b083 in connectDBStart (conn=0x11c0410) at fe-connect.c:2408
#6  0x00007f1cddc48244 in PQconnectStartParams (keywords=0x7ffd1840ed10, values=0x7ffd1840ed40, expand_dbname=1)
    at fe-connect.c:857
#7  0x00007f1cddc4807c in PQconnectdbParams (keywords=0x7ffd1840ed10, values=0x7ffd1840ed40, expand_dbname=1)
    at fe-connect.c:707
#8  0x000000000040ab85 in pgut_connect (dbname=0x11b6270 "postgres", host=0x0, port=0x0, username=0x11b6290 "test_user",
    password=0x0, prompt=DEFAULT, elevel=21) at pgut/pgut.c:532
#9  0x000000000040d1fd in reconnect (elevel=21) at pgut/pgut-fe.c:167
#10 0x0000000000408a81 in repack_cleanup_callback (fatal=true, userdata=0x7ffd18c0b270) at pg_repack.c:2349
#11 0x000000000040c2b4 in call_atexit_callbacks (fatal=true) at pgut/pgut.c:1240
#12 0x000000000040c318 in exit_or_abort (exitcode=-2, elevel=21) at pgut/pgut.c:1259
#13 0x000000000040ba6b in pgut_errfinish (dummy=0) at pgut/pgut.c:930
#14 0x000000000040b062 in pgut_execute_elevel (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0,
    elevel=21) at pgut/pgut.c:672
#15 0x000000000040aeb1 in pgut_execute (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0)
    at pgut/pgut.c:626
#16 0x000000000040b0a3 in pgut_command (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0)
    at pgut/pgut.c:688
#17 0x00000000004089d5 in lock_exclusive (conn=0x11c0410, relid=0x7ffd1840f4f0 "16431",
    lock_query=0x11cb43c "LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE", start_xact=false) at pg_repack.c:2322
#18 0x0000000000408ab5 in repack_cleanup_callback (fatal=true, userdata=0x7ffd18c0b270) at pg_repack.c:2352
#19 0x000000000040c2b4 in call_atexit_callbacks (fatal=true) at pgut/pgut.c:1240
#20 0x000000000040c318 in exit_or_abort (exitcode=-2, elevel=21) at pgut/pgut.c:1259
#21 0x000000000040ba6b in pgut_errfinish (dummy=0) at pgut/pgut.c:930
#22 0x000000000040b062 in pgut_execute_elevel (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0,
    elevel=21) at pgut/pgut.c:672
#23 0x000000000040aeb1 in pgut_execute (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0)
    at pgut/pgut.c:626
#24 0x000000000040b0a3 in pgut_command (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0)
    at pgut/pgut.c:688
#25 0x00000000004089d5 in lock_exclusive (conn=0x11c0410, relid=0x7ffd1840fbc0 "16431",
    lock_query=0x11cb43c "LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE", start_xact=false) at pg_repack.c:2322
#26 0x0000000000408ab5 in repack_cleanup_callback (fatal=true, userdata=0x7ffd18c0b270) at pg_repack.c:2352
#27 0x000000000040c2b4 in call_atexit_callbacks (fatal=true) at pgut/pgut.c:1240
#28 0x000000000040c318 in exit_or_abort (exitcode=-2, elevel=21) at pgut/pgut.c:1259
#29 0x000000000040ba6b in pgut_errfinish (dummy=0) at pgut/pgut.c:930
#30 0x000000000040b062 in pgut_execute_elevel (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0,
    elevel=21) at pgut/pgut.c:672
#31 0x000000000040aeb1 in pgut_execute (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0)
    at pgut/pgut.c:626
#32 0x000000000040b0a3 in pgut_command (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0)
    at pgut/pgut.c:688
#33 0x00000000004089d5 in lock_exclusive (conn=0x11c0410, relid=0x7ffd18410290 "16431",
    lock_query=0x11cb43c "LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE", start_xact=false) at pg_repack.c:2322
#34 0x0000000000408ab5 in repack_cleanup_callback (fatal=true, userdata=0x7ffd18c0b270) at pg_repack.c:2352
#35 0x000000000040c2b4 in call_atexit_callbacks (fatal=true) at pgut/pgut.c:1240
#36 0x000000000040c318 in exit_or_abort (exitcode=-2, elevel=21) at pgut/pgut.c:1259
……
……
#38402 0x0000000000408ab5 in repack_cleanup_callback (fatal=true, userdata=0x7ffd18c0b270) at pg_repack.c:2352
#38403 0x000000000040c2b4 in call_atexit_callbacks (fatal=true) at pgut/pgut.c:1240
#38404 0x000000000040c318 in exit_or_abort (exitcode=-2, elevel=21) at pgut/pgut.c:1259
#38405 0x000000000040ba6b in pgut_errfinish (dummy=0) at pgut/pgut.c:930
#38406 0x000000000040b062 in pgut_execute_elevel (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0,
    elevel=21) at pgut/pgut.c:672
#38407 0x000000000040aeb1 in pgut_execute (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0)
    at pgut/pgut.c:626
#38408 0x000000000040b0a3 in pgut_command (conn=0x11c0410, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0)
    at pgut/pgut.c:688
#38409 0x00000000004089d5 in lock_exclusive (conn=0x11c0410, relid=0x7ffd18c0aa20 "16431",
    lock_query=0x11cb43c "LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE", start_xact=false) at pg_repack.c:2322
#38410 0x0000000000408ab5 in repack_cleanup_callback (fatal=true, userdata=0x7ffd18c0b270) at pg_repack.c:2352
#38411 0x000000000040c2b4 in call_atexit_callbacks (fatal=true) at pgut/pgut.c:1240
#38412 0x000000000040c318 in exit_or_abort (exitcode=-2, elevel=21) at pgut/pgut.c:1259
#38413 0x000000000040ba6b in pgut_errfinish (dummy=0) at pgut/pgut.c:930
#38414 0x000000000040b062 in pgut_execute_elevel (conn=0x11b63e0, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0,
    elevel=21) at pgut/pgut.c:672
#38415 0x000000000040aeb1 in pgut_execute (conn=0x11b63e0, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0)
    at pgut/pgut.c:626
#38416 0x000000000040b0a3 in pgut_command (conn=0x11b63e0, query=0x4188d9 "RESET lock_timeout", nParams=0, params=0x0)
    at pgut/pgut.c:688
#38417 0x00000000004089d5 in lock_exclusive (conn=0x11b63e0, relid=0x7ffd18c0b160 "16431",
    lock_query=0x11cb43c "LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE", start_xact=true) at pg_repack.c:2322
#38418 0x0000000000406e56 in repack_one_table (table=0x7ffd18c0b270, orderby=0x0, polar_gi_list=0x7ffd18c0b1e0,
    polar_num_gi=0x7ffd18c0b1cc) at pg_repack.c:1568
#38419 0x000000000040586e in repack_one_database (orderby=0x0, errbuf=0x7ffd18c0b370 "\240B\033\001", errsize=256)
    at pg_repack.c:1030
#38420 0x0000000000404018 in main (argc=6, argv=0x7ffd18c0b588) at pg_repack.c:446

In lock_exclusive() we failed to lock the table. So we printed the error and tried to RESET lock_timeout. However, we failed again here and we called the callback function repack_cleanup_callback(). In repack_cleanup_callback() the lock_exclusive() was called again and we failed again when locking the table...

static bool
lock_exclusive(PGconn *conn, const char *relid, const char *lock_query, bool start_xact)
{
        ……
    for (i = 1; ; i++)
    {
                ……
        res = pgut_execute_elevel(conn, lock_query, 0, NULL, DEBUG2);
        if (PQresultStatus(res) == PGRES_COMMAND_OK)
                        ……
        else if (sqlstate_equals(res, SQLSTATE_LOCK_NOT_AVAILABLE))
                        ……
        else
        {
            /* exit otherwise */
            printf("%s", PQerrorMessage(connection));
            CLEARPGRES(res);
            ret = false;
            break;
        }
    }
    pgut_command(conn, "RESET lock_timeout", 0, NULL);
    return ret;
}

Even if the permission denied error will never happen in pg_repack community version, I think there could be other scenarios that we fail to lock the table and reprodule the coredump. So this is not an invalid bug?

How can we avoid endless recursive call of the callback function? Maybe we can limit the stack depth to avoid coredump?

za-arthur commented 1 week ago

This should have been fixed by the PRs: https://github.com/reorg/pg_repack/pull/427, https://github.com/reorg/pg_repack/pull/431