oss-tsukuba / gfarm

distributed file system for large-scale cluster computing and wide-area data sharing. provides fine-grained replica location control.
Other
32 stars 12 forks source link

invalid call of PQclear(res) in gfarm_pgsql_start_with_retry() #67

Closed gfarm-admin closed 5 years ago

gfarm-admin commented 14 years ago

The test script in #66 with "export MALLOCCHECK=3",
the following symptoms of invaild free() are detected:

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x44e2f940 (LWP 3978)]
0x0000003136a30265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003136a30265 in raise () from /lib64/libc.so.6
#1  0x0000003136a31d10 in abort () from /lib64/libc.so.6
#2  0x0000003136a6a84b in __libc_message () from /lib64/libc.so.6
#3  0x0000003136a75d81 in free_check () from /lib64/libc.so.6
#4  0x0000003136a727c1 in free () from /lib64/libc.so.6
#5  0x0000000000430c00 in gfarm_pgsql_start_with_retry (
    diag=0x43f5e5 "pgsql_begin") at ./db_pgsql.c:445
#6  0x000000000042890b in db_thread (arg=<value optimized out>)
    at db_access.c:317
#7  0x0000003137206617 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003136ad3c2d in clone () from /lib64/libc.so.6

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x4cb5a940 (LWP 24408)]
0x0000003136a30265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003136a30265 in raise () from /lib64/libc.so.6
#1  0x0000003136a31d10 in abort () from /lib64/libc.so.6
#2  0x0000003136a6a84b in __libc_message () from /lib64/libc.so.6
#3  0x0000003136a75d81 in free_check () from /lib64/libc.so.6
#4  0x0000003136a727c1 in free () from /lib64/libc.so.6
#5  0x000000000041c0b8 in inode_close_read (fo=<value optimized out>, 
    atime=0x0) at inode.c:1961
        inode_open_state_free(inode->u.c.state);
#6  0x0000000000412bbd in process_close_file (process=0x148de3e0, 
    peer=0x2aaaaab29d30, fd=1) at process.c:774
#7  0x000000000040eb40 in peer_fdpair_clear (peer=0x2aaaaab29d30) at peer.c:853
#8  0x0000000000406f1f in protocol_service (peer=0x2aaaaab29d30) at gfmd.c:600
#9  0x0000000000406f4c in protocol_main (arg=<value optimized out>)
    at gfmd.c:647
#10 0x0000000000407653 in thrpool_worker (arg=<value optimized out>)
    at thrpool.c:142
#11 0x0000003137206617 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003136ad3c2d in clone () from /lib64/libc.so.6

(gdb) frame 5
#5  0x000000000041c0b8 in inode_close_read (fo=<value optimized out>, 
    atime=0x0) at inode.c:1961
1961                    inode_open_state_free(inode->u.c.state);
(gdb) print *inode->u.c.state
$1 = {openings = {opening_prev = 0x14930fb0, opening_next = 0x14930fb0, 
    inode = 0x3f847ae147ae147b, flag = 0, opener = 0x2931cf0, u = {f = {
        spool_opener = 0xb057088, spool_host = 0x148a7250, 
        replicating = 0x4ba100ef}, d = {offset = 184905864, 
        key = 0x148a7250 "@p\212\024"}}}, u = {f = {cksum_owner = 0x0, 
      event_waiters = 0x0, event_source = 0x0, last_update = {tv_sec = 0, 
        tv_nsec = 0}, writers = 0}}}
(gdb) print &inode->u.c.state->openings
$2 = (struct file_opening *) 0x14930fb0
(gdb) print ios->openings.opening_next
$3 = (struct file_opening *) 0x14930fb0
(gdb) print ios
$4 = (struct inode_open_state *) 0x14930fb0
(gdb) print inode->u.c.state
$5 = (struct inode_open_state *) 0x14930fb0
(gdb) print *inode
$6 = {i_number = 13, i_gen = 210, i_nlink = 2, i_size = 0, 
  i_user = 0x148a8050, i_group = 0x148baaa0, i_atimespec = {
    tv_sec = 1268823661, tv_nsec = 220997000}, i_mode = 16877, i_mtimespec = {
    tv_sec = 1268842811, tv_nsec = 393669000}, i_ctimespec = {
    tv_sec = 1268378518, tv_nsec = 356965000}, i_xattrs = {head = 0x0, 
    tail = 0x0}, i_xmlattrs = {head = 0x0, tail = 0x0}, u = {l = {
      prev = 0x148c3c30, next = 0x0}, c = {s = {f = {copies = 0x148c3c30, 
          cksum = 0x0, rstate = 0x0}, d = {entries = 0x148c3c30}, l = {
          source_path = 0x148c3c30 "\220X\214\024"}}, state = 0x14930fb0}}}
(gdb) print/x inode->i_mode
$7 = 0x41ed
(gdb) print/o inode->i_mode
$8 = 040755
(gdb) up
#6  0x0000000000412bbd in process_close_file (process=0x148de3e0, 
    peer=0x2aaaaab29d30, fd=1) at process.c:774
774             inode_close(fo);
(gdb) print *process
$9 = {siblings = {next = 0x148de3e0, prev = 0x148de3e0}, children = {
    next = 0x148de3f0, prev = 0x148de3f0}, parent = 0x0, 
  sharedkey = "[snip]", pid = 2, 
  user = 0x148a8050, refcount = 8, nfiles = 16, filetab = 0x148dd7e0}
(gdb) print *process->user
$10 = {ui = {username = 0x148a7a50 "soda", 
    realname = 0x148a8480 "Gfarm administrator", homedir = 0x148a8460 "/", 
    gsi_dn = 0x148a84a0 ""}, groups = {user_prev = 0x2e636370682e696e, 
    user_next = 0x706a, group_prev = 0x148a8680, group_next = 0x148a8680, 
    u = 0x322e35736f746e65, g = 0x78756e696c2d}, q = {on_db = 0, 
    grace_period = -1, space = -1, space_exceed = -1, space_soft = -1, 
    space_hard = -1, num = 0, num_exceed = -1, 
    num_soft = 18446744073709551615, num_hard = 18446744073709551615, 
    phy_space = 0, phy_space_exceed = -1, phy_space_soft = -1, 
    phy_space_hard = -1, phy_num = 0, phy_num_exceed = -1, 
    phy_num_soft = 18446744073709551615, phy_num_hard = 18446744073709551615}, 
  invalid = 0}
(gdb) print *process->filetab
$11 = (struct file_opening *) 0x0
(gdb) print process->filetab[1]
$13 = (struct file_opening *) 0x1492efd0
(gdb) print *process->filetab[1]
$14 = {opening_prev = 0x14930fb0, opening_next = 0x14930fb0, 
  inode = 0x148c3b70, flag = 3, opener = 0x2aaaaab29d30, u = {f = {
      spool_opener = 0x0, spool_host = 0x0, replicating = 0x5a}, d = {
      offset = 0, key = 0x0}}}
(gdb) print process->filetab[1]->opener
$18 = (struct peer *) 0x2aaaaab29d30
(gdb) print *process->filetab[1]->opener
$19 = {next_close = 0x0, refcount = 0, conn = 0x148f2340, async = 0x0, 
  id_type = GFARM_AUTH_ID_TYPE_USER, username = 0x0, 
  hostname = 0x148f17d0 "les00.omni.hpcc.jp", user = 0x148a8050, host = 0x0, 
  process = 0x148de3e0, protocol_error = 0, 
  protocol_handler = 0x406f40 <protocol_main>, 
  handler_thread_pool = 0x1489a600, control = 1, pstate = {nesting_level = 1, 
    cs = {current_part = 0, cause = 0, skip = 0}}, fd_current = 1, 
  fd_saved = -1, flags = 0, findxmlattrctx = 0x0, 
  pending_new_generation = 0x0, u = {client = {jobs = 0x0}}}

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x4743e940 (LWP 11291)]
0x0000003136a6f8d8 in mem2chunk_check () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003136a6f8d8 in mem2chunk_check () from /lib64/libc.so.6
#1  0x0000003136a75c82 in free_check () from /lib64/libc.so.6
#2  0x0000003136a727c1 in free () from /lib64/libc.so.6
#3  0x0000000000422089 in gfm_server_schedule_file (peer=0x2aaaaab29790, 
    from_client=<value optimized out>, skip=<value optimized out>) at fs.c:1059
#4  0x000000000040687b in protocol_switch (peer=0x2aaaaab29790, from_client=1, 
    skip=0, level=1, requestp=0x4743e0a0, on_errorp=0x2aaaaab29800, 
    suspendedp=0x4743e09c) at gfmd.c:276
#5  0x0000000000406d99 in protocol_service (peer=0x2aaaaab29790) at gfmd.c:566
#6  0x0000000000406f8c in protocol_main (arg=<value optimized out>)
    at gfmd.c:647
#7  0x0000000000407693 in thrpool_worker (arg=<value optimized out>)
    at thrpool.c:142
#8  0x0000003137206617 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003136ad3c2d in clone () from /lib64/libc.so.6
(gdb) frame 3
#3  0x0000000000422089 in gfm_server_schedule_file (peer=0x2aaaaab29790, 
    from_client=<value optimized out>, skip=<value optimized out>) at fs.c:1059
1059                    free(domain);

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x497c0940 (LWP 14933)]
0x0000003136a30265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003136a30265 in raise () from /lib64/libc.so.6
#1  0x0000003136a31d10 in abort () from /lib64/libc.so.6
#2  0x0000003136a6a84b in __libc_message () from /lib64/libc.so.6
#3  0x0000003136a75d81 in free_check () from /lib64/libc.so.6
#4  0x0000003136a727c1 in free () from /lib64/libc.so.6
#5  0x0000000000419976 in inode_open_state_free (ios=0x17976800) at inode.c:481
#6  0x000000000041d290 in inode_close_read (fo=0x179755e0, atime=0x0)
    at inode.c:1973
#7  0x000000000041d1fd in inode_close (fo=0x179755e0) at inode.c:1958
#8  0x00000000004142cf in process_close_file (process=0x17972110, 
    peer=0x2aaaaab29a10, fd=0) at process.c:774
#9  0x000000000041111b in peer_fdpair_set_current (peer=0x2aaaaab29a10, fd=1)
    at peer.c:908
#10 0x00000000004212d4 in gfm_server_open_common (
    diag=0x4433d0 "GFM_PROTO_OPEN", peer=0x2aaaaab29a10, from_client=1, 
    name=0x179762c0 "load_test", flag=3, to_create=0, mode=0, 
    inump=0x497bffb0, genp=0x497bffa8, modep=0x497bffa4) at fs.c:294
#11 0x000000000042161c in gfm_server_open (peer=0x2aaaaab29a10, from_client=1, 
    skip=0) at fs.c:371
#12 0x0000000000405525 in protocol_switch (peer=0x2aaaaab29a10, from_client=1, 
    skip=0, level=1, requestp=0x497c0068, on_errorp=0x2aaaaab29a80, 
    suspendedp=0x497c0064) at gfmd.c:234
#13 0x0000000000405f49 in protocol_service (peer=0x2aaaaab29a10) at gfmd.c:566
#14 0x0000000000406156 in protocol_main (arg=0x2aaaaab29a10) at gfmd.c:647
#15 0x0000000000407cde in thrpool_worker (arg=0x1792c600) at thrpool.c:142
#16 0x0000003137206617 in start_thread () from /lib64/libpthread.so.0
#17 0x0000003136ad3c2d in clone () from /lib64/libc.so.6

(gdb) frame 5
#5  0x0000000000419976 in inode_open_state_free (ios=0x17976800) at inode.c:481
481             free(ios);
(gdb) print *ios
$1 = {openings = {opening_prev = 0x17976800, opening_next = 0x17976800, 
    inode = 0x0, flag = 0, opener = 0x0, u = {f = {spool_opener = 0x1ce6a24, 
        spool_host = 0xbca2354, replicating = 0x17938e40}, d = {
        offset = 30304804, 
        key = 0xbca2354 <Address 0xbca2354 out of bounds>}}}, u = {f = {
      cksum_owner = 0x0, event_waiters = 0x0, event_source = 0x0, 
      last_update = {tv_sec = 0, tv_nsec = 0}, writers = 0}}}
(gdb) frame 9
#9  0x000000000041111b in peer_fdpair_set_current (peer=0x2aaaaab29a10, fd=1)
    at peer.c:908
908                     process_close_file(peer->process, peer, peer->fd_current);

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x41a21940 (LWP 13959)]
0x0000003136a6f8d8 in mem2chunk_check () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003136a6f8d8 in mem2chunk_check () from /lib64/libc.so.6
#1  0x0000003136a75c82 in free_check () from /lib64/libc.so.6
#2  0x0000003136a727c1 in free () from /lib64/libc.so.6
#3  0x0000000000422870 in gfm_server_fstat (peer=0x2aaaaab29b70, 
    from_client=1, skip=0) at fs.c:678
#4  0x000000000040563d in protocol_switch (peer=0x2aaaaab29b70, from_client=1, 
    skip=0, level=1, requestp=0x41a21068, on_errorp=0x2aaaaab29c0c, 
    suspendedp=0x41a21064) at gfmd.c:258
#5  0x0000000000405fa9 in protocol_service (peer=0x2aaaaab29b70) at gfmd.c:566
#6  0x00000000004061b6 in protocol_main (arg=0x2aaaaab29b70) at gfmd.c:647
#7  0x0000000000407d52 in thrpool_worker (arg=0x1c2bd600) at thrpool.c:142
#8  0x0000003137206617 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003136ad3c2d in clone () from /lib64/libc.so.6

(gdb) bt
#0  0x0000003136a30265 in raise () from /lib64/libc.so.6
#1  0x0000003136a31d10 in abort () from /lib64/libc.so.6
#2  0x0000003136a6a84b in __libc_message () from /lib64/libc.so.6
#3  0x0000003136a75d81 in free_check () from /lib64/libc.so.6
#4  0x0000003136a727c1 in free () from /lib64/libc.so.6
#5  0x000000000040eb67 in gfm_server_host_info_get_by_names_common (
    peer=0x2aaaaab2a7a0, from_client=0, skip=0, 
    lookup=0x40c8ef <host_namealiases_lookup>, 
    diag=0x43f7a1 "host_info_get_by_namealiases") at host.c:1385
#6  0x000000000040ebe6 in gfm_server_host_info_get_by_namealiases (
    peer=0x2aaaaab2a7a0, from_client=0, skip=0) at host.c:1402
#7  0x000000000040531f in protocol_switch (peer=0x2aaaaab2a7a0, from_client=0, 
    skip=0, level=0, requestp=0x4770c068, on_errorp=0x4770c06c, 
    suspendedp=0x4770c064) at gfmd.c:150
#8  0x0000000000405efa in protocol_service (peer=0x2aaaaab2a7a0) at gfmd.c:543
#9  0x00000000004061b6 in protocol_main (arg=0x2aaaaab2a7a0) at gfmd.c:647
#10 0x0000000000407d52 in thrpool_worker (arg=0xce96600) at thrpool.c:142
#11 0x0000003137206617 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003136ad3c2d in clone () from /lib64/libc.so.6
(gdb) frame 5
#5  0x000000000040eb67 in gfm_server_host_info_get_by_names_common (
    peer=0x2aaaaab2a7a0, from_client=0, skip=0, 
    lookup=0x40c8ef <host_namealiases_lookup>, 
    diag=0x43f7a1 "host_info_get_by_namealiases") at host.c:1385
1385                    free(hosts[i]);
(gdb) print hosts[i]
$1 = 0xd29ac40 "les-serv"
(gdb) print i
$2 = 0
(gdb) print nhosts
$3 = 1

The Cause

This is a bug of glibc:
http://sourceware.org/bugzilla/show_bug.cgi?id=10282
(pointed out by tatebe@-san)

Running the test program in the bugzilla #10282 under gdb easily triggers this symptom even without using Gfarm.


The Countermeasure

This is fixed in glibc-current as follows:
http://sourceware.org/git/?p=glibc.git;a=commitdiff;h=cc49a5a8837be1f9307b167d9bf4399798a847c9

Not setting the MALLOCCHECK environment variable can workaround this problem.
Thus, ${MALLOCCHECK} shouldn't be set until glibc is fixed.

(pointed out by tatebe@-san)

Reported by: n-soda

Original Ticket: gfarm/tickets/67

gfarm-admin commented 14 years ago

this bug was introduced in r4412, since r4411 didn't have this problem,

Original comment by: n-soda

gfarm-admin commented 14 years ago

Original comment by: n-soda

gfarm-admin commented 14 years ago

add some more symptoms, which are not related to libpq, to the description

Original comment by: n-soda

gfarm-admin commented 14 years ago

typo in the summary field

Original comment by: n-soda

gfarm-admin commented 14 years ago

tatebe@-san pointed out that this is a bug of glibc.

update the description field about the glibc bug.

Original comment by: n-soda

gfarm-admin commented 14 years ago

clearify about the test program and the state of RHEL5/CentOS 5

Original comment by: n-soda

gfarm-admin commented 14 years ago

The following part was incorrect (I looked at wrong source, glibc-2.5 official one, instead of Red Hat 5 source RPM), thus removed the sentence from the description:

It seems glibc in Red Hat Enterprise 5 and CentOS 5 already includes the relevant fix,
but running the test program in the bugzilla #10282 under gdb still triggers this symptom.

Original comment by: n-soda