oss-tsukuba / gfarm

distributed file system for large-scale cluster computing and wide-area data sharing. provides fine-grained replica location control.
Other
32 stars 12 forks source link

gfmd hangs at startup, if many matadata are corrected in inode_check_and_repair() #353

Open gfarm-admin opened 12 years ago

gfarm-admin commented 12 years ago

this problem was originally found on r6030 + local modification for #341 and #349,
but have existed since gfarm-2.5.0, if metadata replication is enabled.

In this test environment,
metadb_journal_max_size is shrinked for testing as follows:

metadb_journal_max_size 5120

and gfjournal says it's empty:

records  seqnum(min/max)            record length(min/max/ave)
      0             0/           0     4294967295/      0/      0

log message at the hang is:

[1003273] start boot-apply db journal
[1003275] end boot-apply db journal
[1002835] inode 349: dot didn't exist: fixed
[1002491] inode 349 is not referenced, moving to /lost+found
[1002483] create /lost+found directory
[1002813] inode 349: dotdot didn't exist: fixed to 347
[1002835] inode 351: dot didn't exist: fixed
[1002491] inode 351 is not referenced, moving to /lost+found
[1002813] inode 351: dotdot didn't exist: fixed to 347
[1002835] inode 352: dot didn't exist: fixed
[1002491] inode 352 is not referenced, moving to /lost+found

The above log message were results of bug #358.
But as far as many metadata are corrected in inode_check_and_repair(),
gfmd hangs.

stack dump by gdb is:

Thread 12 (Thread 0x40a5d940 (LWP 9168)):
#0  0x0000003f5220aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002acba81d0659 in gfarm_cond_wait (cond=0x69164c, mutex=0x80, 
    where=0x1 <Address 0x1 out of bounds>, 
    what=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>)
    at ./thrsubr.c:65
#2  0x000000000044cf5a in dbq_delete (q=0x691620, entp=0x40a5d0d0)
    at db_access.c:316
#3  0x000000000044d04d in db_thread (arg=<value optimized out>)
    at db_access.c:423
#4  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x4145e940 (LWP 9169)):
#0  0x0000003f5220e838 in do_sigwait () from /lib64/libpthread.so.0
#1  0x0000003f5220e8dd in sigwait () from /lib64/libpthread.so.0
#2  0x00000000004140bb in sigs_handler (p=<value optimized out>) at gfmd.c:1311
#3  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x41e5f940 (LWP 9170)):
#0  0x0000003f51ad48a8 in epoll_wait () from /lib64/libc.so.6
#1  0x00002acba81cd771 in gfarm_eventqueue_turn (q=0xd1ee730, 
    timeo=<value optimized out>) at ./gfevent.c:546
#2  0x00002acba81cd9c9 in gfarm_eventqueue_loop (q=0xd1ee730, timeo=0x0)
    at ./gfevent.c:698
#3  0x0000000000417d02 in watcher (arg=<value optimized out>) at watcher.c:378
#4  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x42860940 (LWP 9171)):
#0  0x0000003f5220b150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x0000000000416c7f in callout_main (arg=<value optimized out>)
    at callout.c:78
#2  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x43261940 (LWP 9173)):
#0  0x0000003f51ad48a8 in epoll_wait () from /lib64/libc.so.6
#1  0x00002acba81cd771 in gfarm_eventqueue_turn (q=0xd200780, 
    timeo=<value optimized out>) at ./gfevent.c:546
#2  0x00002acba81cd9c9 in gfarm_eventqueue_loop (q=0xd200780, timeo=0x0)
    at ./gfevent.c:698
#3  0x0000000000417d02 in watcher (arg=<value optimized out>) at watcher.c:378
#4  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x43c62940 (LWP 9174)):
#0  0x0000003f5220aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002acba81d0659 in gfarm_cond_wait (cond=0x6908ac, mutex=0x80, 
    where=0x1 <Address 0x1 out of bounds>, 
    what=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>)
    at ./thrsubr.c:65
#2  0x0000000000421853 in dfc_workq_dequeue (q=0x690880, 
    new_state=dfcstate_in_flight, diag=0x46d950 "removal_pendingq_dequeue")
    at dead_file_copy.c:160
#3  0x0000000000441446 in remover (closure=0x6908ac) at back_channel.c:310
#4  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x44663940 (LWP 9175)):
#0  0x0000003f51ad48a8 in epoll_wait () from /lib64/libc.so.6
#1  0x00002acba81cd771 in gfarm_eventqueue_turn (q=0xd200cf0, 
    timeo=<value optimized out>) at ./gfevent.c:546
#2  0x00002acba81cd9c9 in gfarm_eventqueue_loop (q=0xd200cf0, timeo=0x0)
    at ./gfevent.c:698
#3  0x0000000000417d02 in watcher (arg=<value optimized out>) at watcher.c:378
#4  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x45064940 (LWP 9176)):
#0  0x0000003f5220aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002acba81d0659 in gfarm_cond_wait (cond=0x69068c, mutex=0x80, 
    where=0x1 <Address 0x1 out of bounds>, 
    what=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>)
    at ./thrsubr.c:65
#2  0x0000000000421853 in dfc_workq_dequeue (q=0x690660, 
    new_state=dfcstate_finalizing, diag=0x46d5c0 "removal_finishedq_dequeue")
    at dead_file_copy.c:160
#3  0x0000000000422699 in removal_finishedq_dequeue (arg=<value optimized out>)
    at dead_file_copy.c:276
#4  removal_finalizer (arg=<value optimized out>) at dead_file_copy.c:324
#5  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#6  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x45a65940 (LWP 9177)):
#0  0x0000003f5220aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002acba81d0659 in gfarm_cond_wait (cond=0x69052c, mutex=0x80, 
    where=0x1 <Address 0x1 out of bounds>, 
    what=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>)
    at ./thrsubr.c:65
#2  0x0000000000420272 in peer_closer (arg=<value optimized out>) at peer.c:259
#3  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x46466940 (LWP 9178)):
#0  0x0000003f5220aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002acba81d0659 in gfarm_cond_wait (cond=0x690d8c, mutex=0x80, 
    where=0x1 <Address 0x1 out of bounds>, 
    what=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>)
    at ./thrsubr.c:65
#2  0x0000000000413aba in resuming_dequeue (q=0x690d60, 
    diag=0x46a3cc "resumer") at gfmd.c:854
#3  0x00000000004162e0 in resumer (arg=<value optimized out>) at gfmd.c:889
#4  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x46e67940 (LWP 9179)):
#0  0x0000003f5220aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002acba81d0659 in gfarm_cond_wait (cond=0xd1f7a8c, mutex=0x80, 
    where=0x1 <Address 0x1 out of bounds>, 
    what=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>)
    at ./thrsubr.c:65
#2  0x0000000000445f06 in journal_file_read (reader=0xd1fbbd0, op_arg=0x0, 
    read_op=0x45b5f0 <db_journal_read_ops>, 
    post_read_op=0x459310 <db_journal_add_rec>, free_op=0, closure=0x46e670c0, 
    eofp=0x0) at journal_file.c:1794
#3  0x000000000045a059 in db_journal_store_thread (arg=<value optimized out>)
    at db_journal.c:3611
#4  0x0000003f5220673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f51ad44bd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2acba865ee30 (LWP 9166)):
#0  0x0000003f5220aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002acba81d0659 in gfarm_cond_wait (cond=0xd1f7a5c, mutex=0x80, 
    where=0x1 <Address 0x1 out of bounds>, 
    what=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>)
    at ./thrsubr.c:65
#2  0x0000000000443aca in journal_file_reader_writer_wait (reader=0xd1fbbd0, 
    jf=0xd1f79c0, rec_len=44) at journal_file.c:705
#3  0x00000000004466db in journal_file_write (jf=0xd1f79c0, seqnum=144141, 
    ope=GFM_JOURNAL_INODE_MTIME_MODIFY, arg=0xd24d150, 
    size_add_op=<value optimized out>, 
    send_op=0x4566f0 <db_journal_write_inode_timespec_modify_core>)
    at journal_file.c:1608
#4  0x0000000000455766 in db_journal_write (seqnum=144141, 
    ope=GFM_JOURNAL_INODE_MTIME_MODIFY, arg=0xd24d150, 
    size_add_op=0x457e00 <db_journal_write_inode_timespec_modify_size_add>, 
    send_op=0x4566f0 <db_journal_write_inode_timespec_modify_core>)
    at db_journal.c:535
#5  0x000000000042f5fe in inode_set_mtime (inode=0xd232b10, 
    mtime=0x7fff544f9e20) at inode.c:1424
#6  0x000000000042f664 in inode_modified (inode=0xd232b10) at inode.c:1481
#7  0x0000000000431db5 in inode_lookup_basename (n=0xd232b10, 
    name=0x691040 '0' <repeats 13 times>, "16", '0' <repeats 16 times>, "1", 
    expected_type=0, op=INODE_LINK, user=0xd201a60, new_mode=0, 
    symlink_src=0x0, inp=0x7fff544f9fb0, createdp=0x0) at inode.c:1919
#8  inode_lookup_relative (n=0xd232b10, 
    name=0x691040 '0' <repeats 13 times>, "16", '0' <repeats 16 times>, "1", 
    expected_type=0, op=INODE_LINK, user=0xd201a60, new_mode=0, 
    symlink_src=0x0, inp=0x7fff544f9fb0, createdp=0x0) at inode.c:2133
#9  0x00000000004329a6 in inode_create_link_internal (base=0xd1f7a5c, 
    name=0x80 <Address 0x80 out of bounds>, user=0xd1f7ae8, inode=0xd2328a0)
    at inode.c:2322
#10 0x000000000043306e in inode_link_to_lost_found (inode=0xd2328a0)
    at inode.c:2500
#11 inode_link_to_lost_found_and_report (inode=0xd2328a0) at inode.c:2521
#12 0x0000000000433510 in inode_check_and_repair_dir (closure=0x7fff544fa08c, 
    inode=0xd2328a0) at inode.c:4784
#13 0x000000000042a17f in inode_lookup_all (closure=0x7fff544fa08c, 
    callback=0x4332c0 <inode_check_and_repair_dir>) at inode.c:1053
#14 0x0000000000432eaa in inode_check_and_repair () at inode.c:4904
#15 0x0000000000414895 in main (argc=0, argv=0x7fff544fa290) at gfmd.c:1672

and when SIGTERM is sent to the gfmd, the gfmd aborts as follows:

[1000199] signal 15 received: terminating...
[1000201] shutting down peers
[1003250] (db_access.c:450 db_begin()) Assertion 'transaction_nesting == 0' failed.
[1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [1/8]: /home/soda/work/gfarm-trunk/lib/libgfarm.so.1(gfarm_log_backtrace_symbols+0x1f) [0x2ae57a9a246f]
[1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [2/8]: /home/soda/work/gfarm-trunk/lib/libgfarm.so.1(gflog_fatal_message+0x83) [0x2ae57a9a4bb3]
[1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [3/8]: /home/soda/work/gfarm-trunk/lib/libgfarm.so.1(gfarm_assert_fail+0x53) [0x2ae57a9a2433]
[1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [4/8]: /home/soda/work/gfarm-trunk/sbin/gfmd(db_begin+0x99) [0x44fa19]
[1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [5/8]: /home/soda/work/gfarm-trunk/sbin/gfmd(gfmd_terminate+0x38) [0x413fd8]
[1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [6/8]: /home/soda/work/gfarm-trunk/sbin/gfmd(sigs_handler+0x109) [0x414199]
[1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [7/8]: /lib64/libpthread.so.0 [0x3f5220673d]
[1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [8/8]: /lib64/libc.so.6(clone+0x6d) [0x3f51ad44bd]

Reported by: n-soda

Original Ticket: gfarm/tickets/353

gfarm-admin commented 12 years ago

Original comment by: n-soda

gfarm-admin commented 12 years ago

Original comment by: tbarbar

gfarm-admin commented 12 years ago

tbaba@-san confirmed that the 2.5 release branch has this problem too.

Original comment by: n-soda

gfarm-admin commented 12 years ago

Original comment by: n-soda

gfarm-admin commented 12 years ago

Milestone gfarm-2.5.4.2 deleted

Original comment by: anonymous at SourceForge

gfarm-admin commented 12 years ago

Original comment by: otatebe