Closed gfarm-admin closed 4 years ago
a try to reproduce #420 made gfmd crash
revision: r6460 on the branches/2.5
hosts:
settings:
gfmd.conf
metadb_server_thread_pool_size 1 simultaneous_replication_receivers 2000 sockopt sndbuf=2048 * sockopt sndbuf=2048 LISTENER sockopt rcvbuf=2048 * sockopt rcvbuf=2048 LISTENER
gfarm2.conf
sockopt sndbuf=2048 * sockopt sndbuf=2048 LISTENER sockopt rcvbuf=2048 * sockopt rcvbuf=2048 LISTENER
reproduction rate is about 50% (in 7 tries).
mount a Gfarm filesystem by using the gfarm2fs command
run 300 dd processes at once on the mount point above:
awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt/dir1/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' & awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt/dir2/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' & awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt/dir3/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' &
when almost 100 dd processes finished, send SIGSTOP signal to the bach channel processing gfsd on each host.
when all dd processes finished, run 2. again.
repeat 2. and 4. again and again (more than 3 times)
kill socket-listening-gfsd by the "service gfsd stop" command
send SIGCONT to the gfsd processes which are stopped by SIGSTOP.
(gdb) c Continuing. [New Thread 0x7fc778dfa700 (LWP 17349)] [New Thread 0x7fc767fff700 (LWP 17350)] [New Thread 0x7fc7675fe700 (LWP 17432)] [New Thread 0x7fc766bfd700 (LWP 17433)] [New Thread 0x7fc7661fc700 (LWP 17434)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fc778dfa700 (LWP 17349)] 0x0000003bb2e091c0 in pthread_mutex_lock () from /lib64/libpthread.so.0 (gdb) bt #0 0x0000003bb2e091c0 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00007fc790911049 in gfarm_mutex_lock (mutex=<value optimized out>, where=0x465120 "peer_replicated", what=0x465110 "replication") at ./thrsubr.c:24 #2 0x000000000041f535 in peer_replicated (peer=0x696e7261773c203a, host=0x119d970, ino=1025, gen=46, handle=-1, src_errcode=0, dst_errcode=55, size=-1) at peer.c:274 #3 0x00000000004403c4 in gfs_client_replication_request_request (closure=<value optimized out>) at back_channel.c:436 #4 0x0000000000414de3 in thrpool_worker (arg=0x1190340) at thrpool.c:142 #5 0x0000003bb2e077f1 in start_thread () from /lib64/libpthread.so.0 #6 0x0000003bb2ae5ccd in clone () from /lib64/libc.so.6
$ sudo tail -n 30 /var/log/messages Aug 6 20:18:38 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1037:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:38 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1037:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:38 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1049:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:38 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1049:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:38 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn01, 1037:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:39 mds01 gfmd[17103]: <err> [1002257] error at 1085:45 replication to fsn01-2: src=61 dst=0 Aug 6 20:18:39 mds01 gfmd[17103]: <err> [1002257] error at 1046:46 replication to fsn01-2: src=61 dst=0 Aug 6 20:18:39 mds01 gfmd[17103]: <err> [1002257] error at 1082:45 replication to fsn01-2: src=61 dst=0 Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn01, 1049:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1043:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1043:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1028:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1043:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1028:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1028:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1040:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1040:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1034:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1040:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1034:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn01, 1040:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1034:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1022:45): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1022:45): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1022:45): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1031:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn01, 1022:45): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1031:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1031:46): s=0 d=55 size:-1 maybe the connection had a problem? Aug 6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1025:46): s=0 d=55 size:-1 maybe the connection had a problem?
software revision is same with Test 1
metadb_server_thread_pool_size 2 simultaneous_replication_receivers 2000 sockopt sndbuf=2048 * sockopt sndbuf=2048 LISTENER sockopt rcvbuf=2048 * sockopt rcvbuf=2048 LISTENER
reproduction rate is about 100% (in 3 tries).
mount a Gfarm filesystem on 3 mount points by using the gfarm2fs command 3 times
awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt/dir1/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' & awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt2/dir2/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' & awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt3/dir3/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' &
when almost 100 dd processes finished, send SIGSTOP signal to the bach channel processing gfsd on 2 filesystem nodes.
on a filesystem node (SIGSTOP was not sent to this back channel gfsd)
Aug 7 14:52:10 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused Aug 7 14:52:11 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused Aug 7 14:52:12 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused Aug 7 14:52:14 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused Aug 7 14:52:15 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused Aug 7 14:55:15 fsn01 gfsd[968]: <err> [1000000] GFS_PROTO_REPLICATION_REQUEST: replica_recv 1134:57: no such file or directory Aug 7 14:58:14 fsn01 gfsd[786]: <err> [1002304] back channel: gfmd is down Aug 7 14:58:59 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused Aug 7 14:58:59 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused Aug 7 14:58:59 fsn01 gfsd[978]: <err> [1000000] GFS_PROTO_REPLICATION_REQUEST: replica_recv 1125:57: no such file or directory Aug 7 14:59:38 fsn01 gfsd[794]: <err> [1002386] back channel disconnected Aug 7 14:59:39 fsn01 gfsd[824]: <err> [1002294] (kazuki-h@cn01) gfmd protocol: compound_begin result error on GFS_PROTO_CLOSE: unexpected EOF Aug 7 14:59:39 fsn01 gfsd[824]: <err> [1003338] (kazuki-h@cn01) GFS_PROTO_CLOSE compound_put_fd_result: unexpected EOF Aug 7 14:59:39 fsn01 gfsd[824]: <err> [1003348] (kazuki-h@cn01) GFS_PROTO_CLOSE: gfmd may be failed over, try to reconnecting Aug 7 14:59:39 fsn01 gfsd[794]: <err> [1000550] connecting to gfmd at mds01:20601 failed, sleep 10 sec: connection refused Aug 7 14:59:39 fsn01 gfsd[824]: <err> [1003330] (kazuki-h@cn01) connecting to gfmd failed: connection refused Aug 7 14:59:39 fsn01 gfsd[824]: <err> [1003349] (kazuki-h@cn01) GFS_PROTO_CLOSE: cannot reconnect to gfm server Aug 7 14:59:39 fsn01 gfsd[824]: <notice> [1000451] (kazuki-h@cn01) disconnected Aug 7 14:59:39 fsn01 gfsd[786]: <err> [1000550] connecting to gfmd at mds01:20601 failed, sleep 10 sec: protocol error Aug 7 14:59:49 fsn01 gfsd[794]: <err> [1000550] connecting to gfmd at mds01:20601 failed, sleep 20 sec: connection refused Aug 7 14:59:49 fsn01 gfsd[786]: <err> [1000550] connecting to gfmd at mds01:20601 failed, sleep 20 sec: connection refused
on metadata server:
Aug 7 14:59:38 mds01 gfmd[31041]: <err> [1002257] error at 1182:60 replication to fsn03-2: src=0 dst=55 Aug 7 14:59:38 mds01 gfmd[31041]: <info> [1002433] cannot remove an incomplete replica (fsn03-2, 1182:60): probably already removed Aug 7 14:59:38 mds01 gfmd[31041]: <err> [1002257] error at 1079:34 replication to fsn03-2: src=0 dst=55 Aug 7 14:59:38 mds01 gfmd[31041]: <info> [1002433] cannot remove an incomplete replica (fsn03-2, 1079:34): probably already removed Aug 7 14:59:38 mds01 gfmd[31041]: <notice> [1000286] (_gfarmfs@fsn03-2) disconnected Aug 7 14:59:38 mds01 gfmd[31041]: <err> [1002257] error at 1164:59 replication to fsn01-2: src=61 dst=0 Aug 7 14:59:38 mds01 gfmd[31041]: <info> [1002433] cannot remove an incomplete replica (fsn01-2, 1164:59): probably already removed Aug 7 14:59:38 mds01 gfmd[31041]: <warning> [1002440] back_channel(fsn01): switching to new connection Aug 7 14:59:38 mds01 gfmd[31041]: <notice> [1000286] (kazuki-h@cn01) disconnected Aug 7 14:59:38 mds01 gfmd[31041]: <err> [1002782] back_channel(fsn01): aborted: unexpected peer switch Aug 7 14:59:38 mds01 gfmd[31041]: <info> [1002433] cannot remove an incomplete replica (fsn01-2, 1125:57): probably already removed Aug 7 14:59:38 mds01 gfmd[31041]: <err> [1002410] orphan replication (fsn03-2, 1182:60): s=0 d=55 size:-1 maybe the connection had a problem? Aug 7 14:59:38 mds01 gfmd[31041]: <err> [1002410] orphan replication (fsn03-2, 1134:57): s=0 d=55 size:-1 maybe the connection had a problem? Aug 7 14:59:38 mds01 kernel: gfmd[31106]: segfault at 119 ip 0000003bb2e091c0 sp 00007faa76bfcc38 error 4 in libpthread-2.12.so[3bb2e00000+17000]
Program terminated with signal 11, Segmentation fault. #0 peer_get_auth_id_type (peer=0xa00000000) at peer.c:956 956 { (gdb) bt #0 peer_get_auth_id_type (peer=0xa00000000) at peer.c:956 #1 0x0000000000426db0 in back_channel_type_name (peer=0xa00000000) at abstract_host.c:35 #2 0x000000000042774e in gfm_client_channel_vsend_request (host=0x90b120, peer0=0xa00000000, diag=0x4737a0 "GFS_PROTO_REPLICATION_REQUEST request", result_callback=0x441970 <gfs_client_replication_request_result>, disconnect_callback=0x4418a0 <gfs_client_replication_request_free>, closure=0x7fb378017880, host_set_callback=0x41c0d0 <host_set_callback>, command=22, format=0x47350f "sill", app=0x7fb374df9c00) at abstract_host.c:782 #3 0x000000000044078c in gfs_client_send_request (host=<value optimized out>, peer0=0xa00000000, diag=0x4737a0 "GFS_PROTO_REPLICATION_REQUEST request", result_callback=0x441970 <gfs_client_replication_request_result>, disconnect_callback=<value optimized out>, closure=<value optimized out>, command=22, format=0x47350f "sill") at back_channel.c:187 #4 0x0000000000441704 in gfs_client_replication_request_request ( closure=<value optimized out>) at back_channel.c:413 #5 0x0000000000415dc3 in thrpool_worker (arg=0x8f6d20) at thrpool.c:142 #6 0x00007fb3ab9b77e1 in start_thread () from /lib64/libpthread.so.0 #7 0x00007fb3ab7128ed in clone () from /lib64/libc.so.6 }}
Reported by: kazuki-h
Original Ticket: gfarm/tickets/426
fix an editing error in the description field
Original comment by: n-soda
fixed in r6534 on the 2.5 release branch, and in r6537 on the main trunk.
r6534 on the 2.5 branch introduced one of the causes of #408 (incomplete replica is left), and it's fixed in r6672
a try to reproduce #420 made gfmd crash
Test 1
configuration
revision: r6460 on the branches/2.5
hosts:
settings:
gfmd.conf
gfarm2.conf
reproduction procedure
reproduction rate is about 50% (in 7 tries).
mount a Gfarm filesystem by using the gfarm2fs command
run 300 dd processes at once on the mount point above:
when almost 100 dd processes finished, send SIGSTOP signal to the bach channel processing gfsd on each host.
when all dd processes finished, run 2. again.
repeat 2. and 4. again and again (more than 3 times)
kill socket-listening-gfsd by the "service gfsd stop" command
send SIGCONT to the gfsd processes which are stopped by SIGSTOP.
gdb back trace
log
Test 2
configuration
software revision is same with Test 1
hosts:
settings:
gfmd.conf
gfarm2.conf
reproduction procedure
reproduction rate is about 100% (in 3 tries).
mount a Gfarm filesystem on 3 mount points by using the gfarm2fs command 3 times
run 300 dd processes at once on the mount point above:
when almost 100 dd processes finished, send SIGSTOP signal to the bach channel processing gfsd on 2 filesystem nodes.
when all dd processes finished, run 2. again.
repeat 2. and 4. again and again (more than 3 times)
log
on a filesystem node (SIGSTOP was not sent to this back channel gfsd)
on metadata server:
Test 3
gdb back trace
Reported by: kazuki-h
Original Ticket: gfarm/tickets/426