oss-tsukuba / gfarm

distributed file system for large-scale cluster computing and wide-area data sharing. provides fine-grained replica location control.
Other
32 stars 12 forks source link

incomplete replica is left (meta ticket) #408

Closed gfarm-admin closed 4 years ago

gfarm-admin commented 12 years ago

This is a meta ticket of "incomplete replica is left", possible causes are:


Same Ticket on Different Branches


Original Condition to Reproduce this problem

  1. set xattr "gfarm.ncopy" to 3
  2. invoke one gfarm2fs process
  3. sequentially create 1000 1GB-files under the gfarm2fs

some replication doesn't complete, as follows:

% gfwhere -al test_file.931:
    rgfsd007            26  - - -
    univ-gfsd402            26  - i -
    ossgfsd29           26  - - -

configuration:

Reported by: n-soda

Original Ticket: gfarm/tickets/408

gfarm-admin commented 12 years ago

one possible cause is fixed in r6567 on the main trunk, and r6568 on the 2.5 release branch,
but we are sure that this is not really the cause of this ticket,
because the following message was not logged in the environment:

<err> [1002260] removing old replica <inum>:<igen> host <host>: no memory

Original comment by: n-soda

gfarm-admin commented 12 years ago

Kiyoshi Imai-san found that reference count of struct peer is not correctly maintained.
i.e. the value is too high.
RPC-disconnect-callback is not called due to this problem.

Original comment by: n-soda

gfarm-admin commented 12 years ago

Originally posted by: in reply to: #2; n-soda

Replying to n-soda:

Kiyoshi Imai-san found that reference count of struct peer is not correctly maintained.
i.e. the value is too high.

There are two places where struct peer leaks, found by Kiyoshi Imai-san.

  1. peer != peer0 case of gfm_server_channel_vput_reply() on the 2.5 branch and async_server_vput_wrapped_reply() on the main trunk,
  2. host_replicating_new() on the 2.5 branch since r6534 (gfarm-2.5.6). the main trunk doesn't have this problem.

Fixing problem 2 is somewhat hard, because it's a chicken-or-egg problem.

Original comment by: *anonymous at SourceForge

gfarm-admin commented 12 years ago

Replying to n-soda:

1. peer != peer0 case of gfm_server_channel_vput_reply() on the 2.5 branch and async_server_vput_wrapped_reply() on the main trunk,

The problem 1 is fixed in r6664 on the 2.5 branch, and r6665 on the main trunk.
Thus, the main trunk may not have this issue anymore.

2. host_replicating_new() on the 2.5 branch since r6534 (gfarm-2.5.6). the main trunk doesn't have this problem.

Fixing problem 2 is somewhat hard, because it's a chicken-or-egg problem.

This is not fixed yet, thus, 2.5 branch still has the issue.

Original comment by: n-soda

gfarm-admin commented 12 years ago

problem 2 is fixed in r6670 on the 2.5 release branch, and in r6671 on the main trunk.

Original comment by: n-soda

gfarm-admin commented 12 years ago

another missing peer_del_ref() was found in host_replicating_new().
this problem was introduced as a fix for #426 - race condition about peer_free() makes gfmd crash,
and resovled in r6672.

Original comment by: n-soda

gfarm-admin commented 12 years ago

the fix for problem 2 (r6670 and r6671) was wrong,
because the async-RPC-cleanup routine does DB access, but the code doesn't acquire the giant lock
(the code before the change had acquired the giant lock before calling peer_free().

problem found by Kiyoshi Imai-san

Original comment by: n-soda

gfarm-admin commented 12 years ago

Replying to n-soda:

the fix for problem 2 (r6670 and r6671) was wrong,

r6670 (2.5) and r6671 (trunk) are backed out in r6690 (2.5) and r6687 (trunk)

Original comment by: n-soda

gfarm-admin commented 12 years ago

part of the changes in r6670 (2.5) and r6671 (trunk) is re-applied,
see #454 - theoretical race condition about peer_free()

Original comment by: n-soda

gfarm-admin commented 12 years ago

the problem 2 is fixed in r6694 on the 2.5 release branch.

NOTE:
We don't have a plan to apply this change to the main trunk,
because the main trunk doesn't have the problem 2,
and this change is somewhat ugly.

Original comment by: n-soda

gfarm-admin commented 12 years ago

Replying to n-soda:

the fix for problem 2 (r6670 and r6671) was wrong,
because the async-RPC-cleanup routine does DB access, but the code doesn't acquire the giant lock

not only that, it has race condition that peer->async may be cleared during abstract_host_sender_lock() or abstract_host_receiver_lock() is acquired.
the async-RPC-cleanup rountine has to be called after the peer->refcount becomes 0 to prevent such race condition.

Original comment by: n-soda

gfarm-admin commented 12 years ago

r6694 had a bug (missing initialization), and that's fixed in r6695

Original comment by: n-soda

gfarm-admin commented 12 years ago

the following symptom is still observed. From Kiyoshi Imai-san:

$ gfwhere -arl /|grep -- '- i'
        dhcp-167-248         284      - i -
        dhcp-167-248         224      - i -
        dhcp-167-248         294      - i -
        dhcp-167-248         255      - i -
        dhcp-167-248         185      - i -
        dhcp-167-248         187      - i -
        dhcp-167-248        1164      - i -

(gdb) p *peer_closing_queue
Structure has no component named operator*.
(gdb) p peer_closing_queue
$7 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
    __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}, ready_to_close = {__data = {__lock = 0, __futex = 4707107,
      __total_seq = 2353554, __wakeup_seq = 2353553, __woken_seq = 2353553, __mutex = 0x693480, __nwaiters = 2, __broadcast_seq = 0},
    __size = "\000\000\000\000#\323G\000\222\351#\000\000\000\000\000\221\351#\000\000\000\000\000\221\351#\000\000\000\000\000\200\064i\000\000\000\000\000\002\000\000\000\000\000\000", __align = 20216870623772672}, head = 0x7ffead400350, tail = 0x7ffead400350}
(gdb) p *peer_closing_queue->head
$8 = {next_close = 0x0, refcount = 0, replication_refcount = 1, conn = 0x11b5e60, async = 0x0, id_type = GFARM_AUTH_ID_TYPE_SPOOL_HOST,
  username = 0x7ffea0000d00 "_gfarmfs", hostname = 0x0, user = 0x0, host = 0xfa92f0, process = 0x0, protocol_error = 1, protocol_error_mutex = {__data = {__lock = 0,
      __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
  watcher = 0xfa8770, readable_event = 0x11bdfb0, pstate = {nesting_level = 0, cs = {current_part = 0, cause = 0, skip = 0}}, fd_current = -1, fd_saved = -1,
  flags = 0, findxmlattrctx = 0x0, pending_new_generation = 0x0, u = {client = {jobs = 0x0}}, replication_mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
      __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
  simultaneous_replication_receivers = 1, replicating_inodes = {prev_inode = 0x7ffe94012750, next_inode = 0x7ffe94012750, prev_host = 0x0, next_host = 0x0,
    peer = 0x0, dst = 0x0, handle = 0, inode = 0x0, igen = 0, cleanup = 0x0}, cookies = {id = 0, hcircleq = {hcqe_next = 0x7ffead400498, hcqe_prev = 0x7ffead400498}}}

Original comment by: n-soda

gfarm-admin commented 12 years ago

2.5 branch should be fixed in r6799.
Thanks Kiyoshi Imai-san for finding the cause!

Original comment by: n-soda

gfarm-admin commented 12 years ago

Replying to n-soda:

2.5 branch should be fixed in r6799.
Thanks Kiyoshi Imai-san for finding the cause!

Imai-san tested r6799, and although it improved the situation,
some incomplete replicas still exist.
so, there must be another cause. ;-/

Original comment by: n-soda

gfarm-admin commented 12 years ago

r6799 made a minor mistake about diag message, and that's fixed in r6800.
(no functional change)

Original comment by: n-soda

gfarm-admin commented 12 years ago

start to convert this ticket to a meta ticket: split one of the problems to #476 as the first step

Original comment by: n-soda

gfarm-admin commented 12 years ago

the last cause (missing reply about GFS_PROTO_REPLICATION_REQUEST from gfsd to gfmd) is found and fixed by Kiyoshi Imai-san (Thanks!),
and that's commited as r6807 (trunk) and r6809 (2.5)

Original comment by: n-soda

gfarm-admin commented 12 years ago

Replying to n-soda:

and that's commited as r6807 (trunk) and r6809 (2.5)

this had an error which was made by me,
and that's fixed in r6810 (trunk) and r6811 (2.5).
the error was found by Kiyoshi Imai-san.

Original comment by: n-soda

gfarm-admin commented 12 years ago

split into #478, #479, #480, #481 and #482 as well.

Original comment by: n-soda

gfarm-admin commented 12 years ago

Original comment by: n-soda

gfarm-admin commented 12 years ago

another symptom is reported by tatebe@-san as #490 - incomplete replica is left, if resolving hostname of the source filesystem node fails at a client-initiated replication

Original comment by: n-soda

gfarm-admin commented 12 years ago

Original comment by: n-soda

gfarm-admin commented 12 years ago

add #500 - incomplete replica is left, if malloc(3) fails

Original comment by: n-soda

gfarm-admin commented 12 years ago

Originally posted by: in reply to: #24; n-soda

Replying to n-soda:

add #500 - incomplete replica is left, if malloc(3) fails

fixed in r6898 (only exists on the main trunk)

Original comment by: *anonymous at SourceForge

gfarm-admin commented 12 years ago

Replying to n-soda:

another symptom is reported by tatebe@-san as #490 - incomplete replica is left, if resolving hostname of the source filesystem node fails at a client-initiated replication

This was not a bug, but starvation due to memory shortage.
the summary field is changed to:
incomplete replica is left, when virtual memory space for mmap(2) is exhausted

Original comment by: n-soda

gfarm-admin commented 12 years ago

Replying to n-soda:

Replying to n-soda:

add #500 - incomplete replica is left, if malloc(3) fails

fixed in r6898 (only exists on the main trunk)

another problem (which exists only on 2.5 and 2.5.7 branch) was fixed in r6909 (2.5 branch)

Original comment by: n-soda

gfarm-admin commented 10 years ago

Original comment by: otatebe

gfarm-admin commented 10 years ago

Original comment by: otatebe

gfarm-admin commented 10 years ago

Original comment by: otatebe

gfarm-admin commented 10 years ago

Diff:

Original comment by: otatebe

gfarm-admin commented 10 years ago

Original comment by: otatebe

gfarm-admin commented 10 years ago

Original comment by: otatebe

gfarm-admin commented 10 years ago

Original comment by: otatebe

gfarm-admin commented 10 years ago

Original comment by: otatebe

gfarm-admin commented 9 years ago

Original comment by: otatebe

gfarm-admin commented 9 years ago

Original comment by: otatebe

gfarm-admin commented 6 years ago

Original comment by: n-soda

gfarm-admin commented 6 years ago

all of the subtickets have been closed until gfarm-2.5.8-rc1

Original comment by: n-soda