oss-tsukuba / gfarm

distributed file system for large-scale cluster computing and wide-area data sharing. provides fine-grained replica location control.
Other
32 stars 12 forks source link

race condition about peer_free() makes gfmd crash #426

Closed gfarm-admin closed 4 years ago

gfarm-admin commented 12 years ago

a try to reproduce #420 made gfmd crash

Test 1

configuration

revision: r6460 on the branches/2.5

hosts:

settings:

reproduction procedure

reproduction rate is about 50% (in 7 tries).

  1. mount a Gfarm filesystem by using the gfarm2fs command

  2. run 300 dd processes at once on the mount point above:

    awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt/dir1/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' &
    awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt/dir2/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' &
    awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt/dir3/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' &
  3. when almost 100 dd processes finished, send SIGSTOP signal to the bach channel processing gfsd on each host.

  4. when all dd processes finished, run 2. again.

  5. repeat 2. and 4. again and again (more than 3 times)

  6. kill socket-listening-gfsd by the "service gfsd stop" command

  7. send SIGCONT to the gfsd processes which are stopped by SIGSTOP.

gdb back trace

(gdb) c
Continuing.
[New Thread 0x7fc778dfa700 (LWP 17349)]
[New Thread 0x7fc767fff700 (LWP 17350)]
[New Thread 0x7fc7675fe700 (LWP 17432)]
[New Thread 0x7fc766bfd700 (LWP 17433)]
[New Thread 0x7fc7661fc700 (LWP 17434)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fc778dfa700 (LWP 17349)]
0x0000003bb2e091c0 in pthread_mutex_lock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003bb2e091c0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007fc790911049 in gfarm_mutex_lock (mutex=<value optimized out>, where=0x465120 "peer_replicated", what=0x465110 "replication") at ./thrsubr.c:24
#2  0x000000000041f535 in peer_replicated (peer=0x696e7261773c203a, host=0x119d970, ino=1025, gen=46, handle=-1, src_errcode=0, dst_errcode=55, size=-1) at peer.c:274
#3  0x00000000004403c4 in gfs_client_replication_request_request (closure=<value optimized out>) at back_channel.c:436
#4  0x0000000000414de3 in thrpool_worker (arg=0x1190340) at thrpool.c:142
#5  0x0000003bb2e077f1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003bb2ae5ccd in clone () from /lib64/libc.so.6

log

$ sudo tail -n 30 /var/log/messages
Aug  6 20:18:38 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1037:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:38 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1037:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:38 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1049:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:38 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1049:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:38 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn01, 1037:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:39 mds01 gfmd[17103]: <err> [1002257] error at 1085:45 replication to fsn01-2: src=61 dst=0
Aug  6 20:18:39 mds01 gfmd[17103]: <err> [1002257] error at 1046:46 replication to fsn01-2: src=61 dst=0
Aug  6 20:18:39 mds01 gfmd[17103]: <err> [1002257] error at 1082:45 replication to fsn01-2: src=61 dst=0
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn01, 1049:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1043:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1043:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1028:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1043:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1028:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1028:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1040:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1040:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1034:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1040:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1034:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn01, 1040:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1034:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1022:45): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1022:45): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1022:45): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1031:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn01, 1022:45): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03-2, 1031:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn03, 1031:46): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  6 20:18:40 mds01 gfmd[17103]: <err> [1002410] orphan replication (fsn02, 1025:46): s=0 d=55 size:-1 maybe the connection had a problem?

Test 2

configuration

software revision is same with Test 1

hosts:

settings:

reproduction procedure

reproduction rate is about 100% (in 3 tries).

  1. mount a Gfarm filesystem on 3 mount points by using the gfarm2fs command 3 times

  2. run 300 dd processes at once on the mount point above:

    awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt/dir1/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' &
    awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt2/dir2/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' &
    awk BEGIN'{for (i = 0; i < 100; i++) {cp_exec = sprintf("dd if=/dev/urandom of=mnt3/dir3/hoge%05d.txt bs=10 count=1 &", i); system(cp_exec);} exit;}' &
  3. when almost 100 dd processes finished, send SIGSTOP signal to the bach channel processing gfsd on 2 filesystem nodes.

  4. when all dd processes finished, run 2. again.

  5. repeat 2. and 4. again and again (more than 3 times)

log

on a filesystem node (SIGSTOP was not sent to this back channel gfsd)

Aug  7 14:52:10 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused
Aug  7 14:52:11 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused
Aug  7 14:52:12 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused
Aug  7 14:52:14 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused
Aug  7 14:52:15 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused
Aug  7 14:55:15 fsn01 gfsd[968]: <err> [1000000] GFS_PROTO_REPLICATION_REQUEST: replica_recv 1134:57: no such file or directory
Aug  7 14:58:14 fsn01 gfsd[786]: <err> [1002304] back channel: gfmd is down
Aug  7 14:58:59 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused
Aug  7 14:58:59 fsn01 gfsd[794]: <err> [1002184] GFS_PROTO_REPLICATION_REQUEST: connecting to fsn01:13600: connection refused
Aug  7 14:58:59 fsn01 gfsd[978]: <err> [1000000] GFS_PROTO_REPLICATION_REQUEST: replica_recv 1125:57: no such file or directory
Aug  7 14:59:38 fsn01 gfsd[794]: <err> [1002386] back channel disconnected
Aug  7 14:59:39 fsn01 gfsd[824]: <err> [1002294] (kazuki-h@cn01) gfmd protocol: compound_begin result error on GFS_PROTO_CLOSE: unexpected EOF
Aug  7 14:59:39 fsn01 gfsd[824]: <err> [1003338] (kazuki-h@cn01) GFS_PROTO_CLOSE compound_put_fd_result: unexpected EOF
Aug  7 14:59:39 fsn01 gfsd[824]: <err> [1003348] (kazuki-h@cn01) GFS_PROTO_CLOSE: gfmd may be failed over, try to reconnecting
Aug  7 14:59:39 fsn01 gfsd[794]: <err> [1000550] connecting to gfmd at mds01:20601 failed, sleep 10 sec: connection refused
Aug  7 14:59:39 fsn01 gfsd[824]: <err> [1003330] (kazuki-h@cn01) connecting to gfmd failed: connection refused
Aug  7 14:59:39 fsn01 gfsd[824]: <err> [1003349] (kazuki-h@cn01) GFS_PROTO_CLOSE: cannot reconnect to gfm server
Aug  7 14:59:39 fsn01 gfsd[824]: <notice> [1000451] (kazuki-h@cn01) disconnected
Aug  7 14:59:39 fsn01 gfsd[786]: <err> [1000550] connecting to gfmd at mds01:20601 failed, sleep 10 sec: protocol error
Aug  7 14:59:49 fsn01 gfsd[794]: <err> [1000550] connecting to gfmd at mds01:20601 failed, sleep 20 sec: connection refused
Aug  7 14:59:49 fsn01 gfsd[786]: <err> [1000550] connecting to gfmd at mds01:20601 failed, sleep 20 sec: connection refused

on metadata server:

Aug  7 14:59:38 mds01 gfmd[31041]: <err> [1002257] error at 1182:60 replication to fsn03-2: src=0 dst=55
Aug  7 14:59:38 mds01 gfmd[31041]: <info> [1002433] cannot remove an incomplete replica (fsn03-2, 1182:60): probably already removed
Aug  7 14:59:38 mds01 gfmd[31041]: <err> [1002257] error at 1079:34 replication to fsn03-2: src=0 dst=55
Aug  7 14:59:38 mds01 gfmd[31041]: <info> [1002433] cannot remove an incomplete replica (fsn03-2, 1079:34): probably already removed
Aug  7 14:59:38 mds01 gfmd[31041]: <notice> [1000286] (_gfarmfs@fsn03-2) disconnected
Aug  7 14:59:38 mds01 gfmd[31041]: <err> [1002257] error at 1164:59 replication to fsn01-2: src=61 dst=0
Aug  7 14:59:38 mds01 gfmd[31041]: <info> [1002433] cannot remove an incomplete replica (fsn01-2, 1164:59): probably already removed
Aug  7 14:59:38 mds01 gfmd[31041]: <warning> [1002440] back_channel(fsn01): switching to new connection
Aug  7 14:59:38 mds01 gfmd[31041]: <notice> [1000286] (kazuki-h@cn01) disconnected
Aug  7 14:59:38 mds01 gfmd[31041]: <err> [1002782] back_channel(fsn01): aborted: unexpected peer switch
Aug  7 14:59:38 mds01 gfmd[31041]: <info> [1002433] cannot remove an incomplete replica (fsn01-2, 1125:57): probably already removed
Aug  7 14:59:38 mds01 gfmd[31041]: <err> [1002410] orphan replication (fsn03-2, 1182:60): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  7 14:59:38 mds01 gfmd[31041]: <err> [1002410] orphan replication (fsn03-2, 1134:57): s=0 d=55 size:-1 maybe the connection had a problem?
Aug  7 14:59:38 mds01 kernel: gfmd[31106]: segfault at 119 ip 0000003bb2e091c0 sp 00007faa76bfcc38 error 4 in libpthread-2.12.so[3bb2e00000+17000]

Test 3

gdb back trace

Program terminated with signal 11, Segmentation fault.
#0  peer_get_auth_id_type (peer=0xa00000000) at peer.c:956
956     {
(gdb) bt
#0  peer_get_auth_id_type (peer=0xa00000000) at peer.c:956
#1  0x0000000000426db0 in back_channel_type_name (peer=0xa00000000)
    at abstract_host.c:35
#2  0x000000000042774e in gfm_client_channel_vsend_request (host=0x90b120, 
    peer0=0xa00000000, diag=0x4737a0 "GFS_PROTO_REPLICATION_REQUEST request", 
    result_callback=0x441970 <gfs_client_replication_request_result>, 
    disconnect_callback=0x4418a0 <gfs_client_replication_request_free>, 
    closure=0x7fb378017880, host_set_callback=0x41c0d0 <host_set_callback>, 
    command=22, format=0x47350f "sill", app=0x7fb374df9c00)
    at abstract_host.c:782
#3  0x000000000044078c in gfs_client_send_request (host=<value optimized out>, 
    peer0=0xa00000000, diag=0x4737a0 "GFS_PROTO_REPLICATION_REQUEST request", 
    result_callback=0x441970 <gfs_client_replication_request_result>, 
    disconnect_callback=<value optimized out>, closure=<value optimized out>, 
    command=22, format=0x47350f "sill") at back_channel.c:187
#4  0x0000000000441704 in gfs_client_replication_request_request (
    closure=<value optimized out>) at back_channel.c:413
#5  0x0000000000415dc3 in thrpool_worker (arg=0x8f6d20) at thrpool.c:142
#6  0x00007fb3ab9b77e1 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fb3ab7128ed in clone () from /lib64/libc.so.6
}}

Reported by: kazuki-h

Original Ticket: gfarm/tickets/426

gfarm-admin commented 12 years ago

fix an editing error in the description field

Original comment by: n-soda

gfarm-admin commented 12 years ago

Original comment by: n-soda

gfarm-admin commented 12 years ago

fixed in r6534 on the 2.5 release branch,
and in r6537 on the main trunk.

Original comment by: n-soda

gfarm-admin commented 12 years ago

r6534 on the 2.5 branch introduced one of the causes of #408 (incomplete replica is left),
and it's fixed in r6672

Original comment by: n-soda