powa-team / pg_qualstats

A PostgreSQL extension for collecting statistics about predicates, helping find what indices are missing
Other
272 stars 26 forks source link

pg_qualstats : segmentation fault #20

Closed ng-pe closed 6 years ago

ng-pe commented 6 years ago

Hello,

During an important load I observe a problem... a "segmentation fault"... here is the detail : This is an installation with Powa and pg_qualstats. Installation: OS: Centos 7 (CentOS Linux release 7.4.1708 (Core)) CPU : E5620 6core / Ram: 8 GB PostgreSQL : 9.6.8 (from PGDG) Powa : 3.1.1 pg_qualstats : 1.0.3

Postgresql.conf configuration :

work_mem = 16MB
maintenance_work_mem = 128MB shared_buffers = 2GB
shared_preload_libraries='pg_stat_statements,powa,pg_qualstats' powa.frequency='5min' powa.retention='5d' powa.database='powa' powa.coalesce='100' pg_qualstats.enabled = true pg_qualstats.max = 10000

Powa installation :

CREATE DATABASE powa ; \c powa CREATE EXTENSION powa CASCADE; CREATE EXTENSION pg_qualstats;

How to reproduce the bug: Creating the test pgbench table :

/usr/pgsql-9.6/bin/pgbench -i -s 10 test

script4.sql :

\set aaa random(1, 100000) \set bbb random(1, 50000) UPDATE pgbench_accounts SET filler = :bbb where aid = :aaa ;

pg bench :

/usr/pgsql-9.6/bin/pgbench -v -c -f script4.sql -j 1 -n -t 10000000000000000000 test

I added some NOTICE in pgqualstats to better understand the problem and I observe :

  1. First of all i observe out of memory in the postgresql logs :

    2018-04-04 09:32:41 CEST [14337]: [1-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO hash_get_num_entries(pgqs_hash) + hash_get_num_entries(pgqs_localhash) : 1 2018-04-04 09:32:41 CEST [14337]: [2-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_ExecutorEnd, pg_qualstats.c:735 2018-04-04 09:32:41 CEST [14337]: [3-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO hash_get_num_entries(pgqs_hash) (before search) : 0 2018-04-04 09:32:41 CEST [14337]: [4-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_ExecutorEnd, pg_qualstats.c:745 2018-04-04 09:32:41 CEST [14337]: [5-1] user=postgres,db=test,client=[local],app=pgbench ERROR: 53200: out of shared memory 2018-04-04 09:32:41 CEST [14337]: [6-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: hash_search_with_hash_value, dynahash.c:1042 2018-04-04 09:32:41 CEST [14337]: [7-1] user=postgres,db=test,client=[local],app=pgbench STATEMENT: UPDATE pgbench_accounts SET filler = 34564 where aid = 43970 ;

  2. Then the "pgqs_hash" takes a rather high value at once (much higher than "pgqs_max").

    2018-04-04 09:32:41 CEST [14289]: [1-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO hash_get_num_entries(pgqs_hash) + hash_get_num_entries(pgqs_localhash) : 65537 2018-04-04 09:32:41 CEST [14289]: [2-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_ExecutorEnd, pg_qualstats.c:735

  3. This is repeated a few hundred times without the values changing...

    2018-04-04 09:32:41 CEST [14289]: [1143-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO =>>>>pgqs_entry_dealloc<<<<= 2018-04-04 09:32:41 CEST [14289]: [1144-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:813 2018-04-04 09:32:41 CEST [14289]: [1145-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO pgqs_entry_dealloc hash_get_num_entries(pgqs_hash) : 65536 2018-04-04 09:32:41 CEST [14289]: [1146-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:827 2018-04-04 09:32:41 CEST [14289]: [1147-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO pgqs_entry_dealloc hash_estimate_size(pgqs_hash) : 2100680 2018-04-04 09:32:41 CEST [14289]: [1148-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:828

  4. Then after, the memory allocation requests become inconsistent...:

    2018-04-04 09:32:41 CEST [14289]: [12465-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO =>>>>pgqs_entry_dealloc<<<<= 2018-04-04 09:32:41 CEST [14289]: [12466-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:813 2018-04-04 09:32:41 CEST [14289]: [12467-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO pgqs_entry_dealloc hash_get_num_entries(pgqs_hash) : 65536 2018-04-04 09:32:41 CEST [14289]: [12468-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:827 2018-04-04 09:32:41 CEST [14289]: [12469-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO pgqs_entry_dealloc hash_estimate_size(pgqs_hash) : 2100680 2018-04-04 09:32:41 CEST [14289]: [12470-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:828 2018-04-04 09:32:41 CEST [14289]: [12471-1] user=postgres,db=test,client=[local],app=pgbench ERROR: XX000: invalid memory alloc request size 8796093546496 2018-04-04 09:32:41 CEST [14289]: [12472-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: palloc, mcxt.c:900 018-04-04 09:32:44 CEST [14280]: [1-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO hash_get_num_entries(pgqs_hash) + hash_get_num_entries(pgqs_localhash) : 72058693549621249 2018-04-04 09:32:44 CEST [14280]: [2-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_ExecutorEnd, pg_qualstats.c:735 2018-04-04 09:32:44 CEST [14280]: [3-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO =>>>>pgqs_entry_dealloc<<<<= 2018-04-04 09:32:44 CEST [14280]: [4-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:813 2018-04-04 09:32:44 CEST [14280]: [5-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO pgqs_entry_dealloc hash_get_num_entries(pgqs_hash) : 72058693549621248 2018-04-04 09:32:44 CEST [14280]: [6-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:827 2018-04-04 09:32:44 CEST [14280]: [7-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO pgqs_entry_dealloc hash_estimate_size(pgqs_hash) : 2886833749425128424 2018-04-04 09:32:44 CEST [14280]: [8-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:828 2018-04-04 09:32:44 CEST [14280]: [9-1] user=postgres,db=test,client=[local],app=pgbench ERROR: XX000: invalid memory alloc request size 576469548396969984 2018-04-04 09:32:44 CEST [14280]: [10-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: palloc, mcxt.c:900 2018-04-04 09:32:55 CEST [14327]: [1-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO hash_get_num_entries(pgqs_hash) + hash_get_num_entries(pgqs_localhash) : 72340172838076674 2018-04-04 09:32:55 CEST [14327]: [2-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_ExecutorEnd, pg_qualstats.c:735 2018-04-04 09:32:55 CEST [14327]: [3-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO =>>>>pgqs_entry_dealloc<<<<= 2018-04-04 09:32:55 CEST [14327]: [4-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:813 2018-04-04 09:32:55 CEST [14327]: [5-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO pgqs_entry_dealloc hash_get_num_entries(pgqs_hash) : 72340172838076673 2018-04-04 09:32:55 CEST [14327]: [6-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:827 2018-04-04 09:32:55 CEST [14327]: [7-1] user=postgres,db=test,client=[local],app=pgbench NOTICE: 01000: DEBUG_NGO pgqs_entry_dealloc hash_estimate_size(pgqs_hash) : 2893589252348059368 2018-04-04 09:32:55 CEST [14327]: [8-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: pgqs_entry_dealloc, pg_qualstats.c:828 2018-04-04 09:32:55 CEST [14327]: [9-1] user=postgres,db=test,client=[local],app=pgbench ERROR: XX000: invalid memory alloc request size 578721382704613384 2018-04-04 09:32:55 CEST [14327]: [10-1] user=postgres,db=test,client=[local],app=pgbench LOCATION: palloc, mcxt.c:900 2018-04-04 09:32:55 CEST [14327]: [11-1] user=postgres,db=test,client=[local],app=pgbench STATEMENT: UPDATE pgbench_accounts SET filler = 28559 where aid = 30566 ;

  5. Until it crashed:

    2018-04-04 09:32:56 CEST [14464]: [6-1] user=,db=,client=,app= LOG: 00000: server process (PID 14550) was terminated by signal 11: Segmentation fault 2018-04-04 09:32:56 CEST [14464]: [7-1] user=,db=,client=,app= DETAIL: Failed process was running: UPDATE pgbench_accounts SET filler = 33748 where aid = 79882 ; 2018-04-04 09:32:56 CEST [14464]: [8-1] user=,db=,client=,app= LOCATION: LogChildExit, postmaster.c:3574 2018-04-04 09:32:56 CEST [14464]: [9-1] user=,db=,client=,app= LOG: 00000: terminating any other active server processes 2018-04-04 09:32:56 CEST [14464]: [10-1] user=,db=,client=,app= LOCATION: HandleChildCrash, postmaster.c:3294

Core was generated by `postgres: postgres test [local] UPDATE '. Program terminated with signal 11, Segmentation fault.

0 0x00000000007d553d in get_hash_entry (freelist_idx=0, hashp=0x1f6d5e8) at dynahash.c:1283

1283 dynahash.c

the coredump : bt full

#0  0x00000000007d553d in get_hash_entry (freelist_idx=0, hashp=0x1df6638) at dynahash.c:1283
        hctl = 0x7f72c542f380
        newElement = 0x1000000
        borrow_from_idx = <optimized out>
#1  hash_search_with_hash_value (hashp=0x1df6638, keyPtr=keyPtr@entry=0x1ed5518, hashvalue=2293496263, 
    action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7ffed465557f "") at dynahash.c:1032
        hctl = <optimized out>
        keysize = 24
        bucket = <optimized out>
        segment_num = <optimized out>
        segment_ndx = <optimized out>
        segp = <optimized out>
        currBucket = <optimized out>
        prevBucketPtr = 0x7f72c5430138
        match = <optimized out>
        freelist_idx = <optimized out>
        __func__ = "hash_search_with_hash_value"
#2  0x00000000007d55bd in hash_search (hashp=<optimized out>, keyPtr=keyPtr@entry=0x1ed5518, 
    action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7ffed465557f "") at dynahash.c:889
No locals.
#3  0x00007f72c5d93f64 in pgqs_ExecutorEnd (queryDesc=0x1ec7908) at pg_qualstats.c:751
        nvictims = <optimized out>
        info = {num_partitions = 0, ssize = 0, dsize = 0, max_dsize = 0, ffactor = 0, keysize = 24, 
          entrysize = 176, hash = 0x7f72c5d91860 <pgqs_hash_fn>, match = 0x0, keycopy = 0x0, 
          alloc = 0x0, hcxt = 0x0, hctl = 0x0}
        newEntry = <optimized out>
        local_hash_seq = {hashp = 0x1ecfe18, curBucket = 8, curEntry = 0x0}
        context = <optimized out>
        localentry = 0x1ed5518
        queryKey = {queryid = 2441476858}
        queryEntry = <optimized out>
        found = 0 '\000'
        __func__ = "pgqs_ExecutorEnd"
 line 751: newEntry = (pgqsEntry *) hash_search(pgqs_hash, &localentry->key, HASH_ENTER, &found);
#4  0x00000000006d4ad0 in ProcessQuery (plan=<optimized out>, 
    sourceText=0x1e8cdb8 "UPDATE pgbench_accounts SET filler = 33748 where aid = 79882 ;", params=0x0, 
    dest=0x1ec2ab0, completionTag=0x7ffed4655a90 "UPDATE 1") at pquery.c:232
        queryDesc = 0x1ec7908
        __func__ = "ProcessQuery"
#5  0x00000000006d4c7d in PortalRunMulti (portal=portal@entry=0x1e14848, 
    isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000', 
    dest=dest@entry=0x1ec2ab0, altdest=altdest@entry=0x1ec2ab0, 
    completionTag=completionTag@entry=0x7ffed4655a90 "UPDATE 1") at pquery.c:1303
        pstmt = 0x1ec13b0
        stmt = 0x1ec13b0
        active_snapshot_set = 1 '\001'
        stmtlist_item = 0x1ec2a60
#6  0x00000000006d575f in PortalRun (portal=0x1e14848, count=9223372036854775807, 
    isTopLevel=<optimized out>, dest=0x1ec2ab0, altdest=0x1ec2ab0, 
    completionTag=0x7ffed4655a90 "UPDATE 1") at pquery.c:815
        save_exception_stack = 0x7ffed46559a0
        save_context_stack = 0x0
        local_sigjmp_buf = {{__jmpbuf = {0, 3810215649922779956, 32037632, 32254640, 2, 31418472, 
              -3810834759769650380, 3810214515159468852}, __mask_was_saved = 0, __saved_mask = {
              __val = {8215492, 1, 31951320, 9795934, 31541320, 9795934, 1, 1, 88, 31541320, 32037632, 
                9678926, 8304801, 32037552, 2, 31541320}}}}
        result = <optimized out>
        nprocessed = <optimized out>
        saveTopTransactionResourceOwner = 0x1e34798
        saveTopTransactionContext = 0x1df6868
        saveActivePortal = 0x0
        saveResourceOwner = 0x1e34798
        savePortalContext = 0x0
        saveMemoryContext = 0x1df6868
        __func__ = "PortalRun"
#7  0x00000000006d34b6 in PostgresMain (argc=<optimized out>, argv=<optimized out>, 
    dbname=<optimized out>, username=<optimized out>) at postgres.c:1086
        firstchar = 32037552
        input_message = {
          data = 0x1e8cdb8 "UPDATE pgbench_accounts SET filler = 33748 where aid = 79882 ;", len = 63, 
          maxlen = 1024, cursor = 63}
        local_sigjmp_buf = {{__jmpbuf = {31626792, 3810213719076113204, 1, 0, 0, 31588352, 
              -3810834759415231692, 3810214517272742708}, __mask_was_saved = 1, __saved_mask = {
              __val = {0, 31588352, 140131109176696, 0, 206158430256, 140732461832944, 140732461832736, 
                31548584, 16, 31626928, 48, 31548584, 31626792, 0, 31626880, 31626928}}}}
        send_ready_for_query = 0 '\000'
        disable_idle_in_transaction_timeout = <optimized out>
        __func__ = "PostgresMain"
#8  0x000000000046f934 in BackendRun (port=0x1e20340) at postmaster.c:4342
        ac = 1
        secs = 576143237
        usecs = 217392
        i = 1
        av = 0x1e296e8
        maxac = <optimized out>
#9  BackendStartup (port=0x1e20340) at postmaster.c:4016
        bn = 0x1e20000
        pid = 0
#10 ServerLoop () at postmaster.c:1721
        rmask = {fds_bits = {32, 0 <repeats 15 times>}}
        selres = <optimized out>
        now = <optimized out>
        readmask = {fds_bits = {120, 0 <repeats 15 times>}}
        last_lockfile_recheck_time = 1522827994
        last_touch_time = 1522827154
        __func__ = "ServerLoop"
#11 0x00000000006764b9 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1df5280)
    at postmaster.c:1329
        opt = <optimized out>
        status = <optimized out>
        userDoption = <optimized out>
        listen_addr_saved = <optimized out>
        i = <optimized out>
        output_config_variable = <optimized out>
        __func__ = "PostmasterMain"
#12 0x000000000047059e in main (argc=3, argv=0x1df5280) at main.c:228
No locals.

When the "pg_qualstats" extension is disabled the problem does not occur.

Regards,

ng-pe commented 6 years ago

New test : Postgresql 10.3 : KO Posgresql with this commit : https://github.com/postgres/postgres/commit/ab2324fd468c4bc6d8b012552550ed951d97339a#diff-b69bc548b48803f3b79d575d60062556 : KO

pg_qualstats 1.0.2: no bugs

rjuju commented 6 years ago

Thanks a lot for bug report and all off-list additional information. I just pushed https://github.com/powa-team/pg_qualstats/commit/68c1dacbc57de664af4ad9b6e54a4b804f2d729e which fix this issue. Can you confirm that it also fix it for you on all your servers?

ng-pe commented 6 years ago

For me it's okay :) I run a stress test on 2 environments and I no longer observe any crashes.

I confirm it's fixed. Good job rjuju :)

rjuju commented 6 years ago

Thanks! :)