oss-tsukuba / gfarm

distributed file system for large-scale cluster computing and wide-area data sharing. provides fine-grained replica location control.
Other
32 stars 12 forks source link

[gfmd] assertion failure during replication #372

Closed gfarm-admin closed 4 years ago

gfarm-admin commented 12 years ago

gfmd stops by assertion failure during stress test by gfstress.rb

Mar 19 05:05:40 garnet gfmd[9052]: <debug> [1002244] inode(2739) updated while replication: mtime 1332101141.000000000/1332101141.000000000, size: 1048576/1048576, gen:47620046167474193/178
Mar 19 05:05:40 garnet gfmd[9052]: <debug> [1002488] remove_replica(2739, 47620046167474193, garnet.tatebe.net): old, current=178
Mar 19 05:05:40 garnet gfmd[9052]: <err> [1003250] (inode.c:3280 file_replicating_free()) Assertion 'inode_is_file(inode)' failed.
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [1/13]: /usr/local/gfarm-2.5.3/lib/libgfarm.so.1(gfarm_log_backtrace_symbols+0x2d) [0x604eed]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [2/13]: /usr/local/gfarm-2.5.3/lib/libgfarm.so.1(gflog_fatal_message+0x4c) [0x6078cc]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [3/13]: /usr/local/gfarm-2.5.3/lib/libgfarm.so.1(gfarm_assert_fail+0x58) [0x604eb8]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [4/13]: /usr/local/gfarm-2.5.3/sbin/gfmd(file_replicating_free+0x4b) [0x807705b]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [5/13]: /usr/local/gfarm-2.5.3/sbin/gfmd(process_replica_added+0x1b9) [0x806a179]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [6/13]: /usr/local/gfarm-2.5.3/sbin/gfmd(gfm_server_replica_added_common+0xca) [0x8087a6a]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [7/13]: /usr/local/gfarm-2.5.3/sbin/gfmd(gfm_server_replica_added2+0xa6) [0x8087dd6]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [8/13]: /usr/local/gfarm-2.5.3/sbin/gfmd(protocol_switch+0xcbc) [0x8058a2c]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [9/13]: /usr/local/gfarm-2.5.3/sbin/gfmd(protocol_service+0xc0) [0x8058f30]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [10/13]: /usr/local/gfarm-2.5.3/sbin/gfmd(protocol_main+0x20) [0x8059160]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [11/13]: /usr/local/gfarm-2.5.3/sbin/gfmd(thrpool_worker+0x95) [0x805a8f5]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [12/13]: /lib/i386-linux-gnu/libpthread.so.0(+0x5e99) [0x1efe99]
Mar 19 05:05:40 garnet gfmd[9052]: <info> [1003405] (./backtrace.c:26 gfarm_log_backtrace_symbols()) backtrace symbols [13/13]: /lib/i386-linux-gnu/libc.so.6(clone+0x5e) [0xa0773e]
Mar 19 05:05:40 garnet gfsd[4397]: <err> [1002294] (tatebe@garnet.tatebe.net) gfmd protocol: compound_begin result error on GFM_PROTO_REPLICA_ADDED2: unexpected EOF
Mar 19 05:05:40 garnet gfsd[4397]: <err> [1003359] (tatebe@garnet.tatebe.net) gfmd protocol: GFM_PROTO_REPLICA_ADDED2 error on compound_put_fd_result: unexpected EOF
Mar 19 05:05:40 garnet gfsd[4397]: <notice> [1000451] (tatebe@garnet.tatebe.net) disconnected
Mar 19 05:05:40 garnet gfsd[6438]: <err> [1002386] back channel disconnected
Mar 19 05:05:40 garnet gfsd[6438]: <err> [1000550] connecting to gfmd at garnet.tatebe.net:601 failed, sleep 10 sec: connection refused

Related tickets:

Reported by: otatebe

Original Ticket: gfarm/tickets/372

gfarm-admin commented 12 years ago

fixed by r6207 in 2.5 branch and r6208 in trunk.
struct replicating referenced by file opening may be free'ed.
It is not necessary to use file opening structure for client-initiated file replication.

Original comment by: otatebe

gfarm-admin commented 12 years ago

this change had a minor bug. see #500 - incomplete replica is left, if malloc(3) fails

Original comment by: n-soda

gfarm-admin commented 12 years ago

Original comment by: n-soda

gfarm-admin commented 12 years ago

Replying to n-soda:

this change had a minor bug. see #500 - incomplete replica is left, if malloc(3) fails

that was my confusion.
#500 was introduced in r6235 to implement (#142 - generic send queue for backchannel for gfmd), and doesn't exist on the 2.5 branch.
sorry.

Original comment by: n-soda

gfarm-admin commented 12 years ago

Original comment by: n-soda