Closed gfarm-admin closed 4 years ago
one possible cause is fixed in r6567 on the main trunk, and r6568 on the 2.5 release branch,
but we are sure that this is not really the cause of this ticket,
because the following message was not logged in the environment:
<err> [1002260] removing old replica <inum>:<igen> host <host>: no memory
Original comment by: n-soda
Kiyoshi Imai-san found that reference count of struct peer is not correctly maintained.
i.e. the value is too high.
RPC-disconnect-callback is not called due to this problem.
Original comment by: n-soda
Originally posted by: in reply to: #2; n-soda
Replying to n-soda:
Kiyoshi Imai-san found that reference count of struct peer is not correctly maintained.
i.e. the value is too high.
There are two places where struct peer leaks, found by Kiyoshi Imai-san.
Fixing problem 2 is somewhat hard, because it's a chicken-or-egg problem.
Original comment by: *anonymous at SourceForge
Replying to n-soda:
1. peer != peer0 case of gfm_server_channel_vput_reply() on the 2.5 branch and async_server_vput_wrapped_reply() on the main trunk,
The problem 1 is fixed in r6664 on the 2.5 branch, and r6665 on the main trunk.
Thus, the main trunk may not have this issue anymore.
2. host_replicating_new() on the 2.5 branch since r6534 (gfarm-2.5.6). the main trunk doesn't have this problem.
Fixing problem 2 is somewhat hard, because it's a chicken-or-egg problem.
This is not fixed yet, thus, 2.5 branch still has the issue.
Original comment by: n-soda
the problem 2 is fixed in r6694 on the 2.5 release branch.
NOTE:
We don't have a plan to apply this change to the main trunk,
because the main trunk doesn't have the problem 2,
and this change is somewhat ugly.
Original comment by: n-soda
Replying to n-soda:
the fix for problem 2 (r6670 and r6671) was wrong,
because the async-RPC-cleanup routine does DB access, but the code doesn't acquire the giant lock
not only that, it has race condition that peer->async may be cleared during abstract_host_sender_lock() or abstract_host_receiver_lock() is acquired.
the async-RPC-cleanup rountine has to be called after the peer->refcount becomes 0 to prevent such race condition.
Original comment by: n-soda
the following symptom is still observed. From Kiyoshi Imai-san:
$ gfwhere -arl /|grep -- '- i'
dhcp-167-248 284 - i -
dhcp-167-248 224 - i -
dhcp-167-248 294 - i -
dhcp-167-248 255 - i -
dhcp-167-248 185 - i -
dhcp-167-248 187 - i -
dhcp-167-248 1164 - i -
(gdb) p *peer_closing_queue
Structure has no component named operator*.
(gdb) p peer_closing_queue
$7 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}, ready_to_close = {__data = {__lock = 0, __futex = 4707107,
__total_seq = 2353554, __wakeup_seq = 2353553, __woken_seq = 2353553, __mutex = 0x693480, __nwaiters = 2, __broadcast_seq = 0},
__size = "\000\000\000\000#\323G\000\222\351#\000\000\000\000\000\221\351#\000\000\000\000\000\221\351#\000\000\000\000\000\200\064i\000\000\000\000\000\002\000\000\000\000\000\000", __align = 20216870623772672}, head = 0x7ffead400350, tail = 0x7ffead400350}
(gdb) p *peer_closing_queue->head
$8 = {next_close = 0x0, refcount = 0, replication_refcount = 1, conn = 0x11b5e60, async = 0x0, id_type = GFARM_AUTH_ID_TYPE_SPOOL_HOST,
username = 0x7ffea0000d00 "_gfarmfs", hostname = 0x0, user = 0x0, host = 0xfa92f0, process = 0x0, protocol_error = 1, protocol_error_mutex = {__data = {__lock = 0,
__count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
watcher = 0xfa8770, readable_event = 0x11bdfb0, pstate = {nesting_level = 0, cs = {current_part = 0, cause = 0, skip = 0}}, fd_current = -1, fd_saved = -1,
flags = 0, findxmlattrctx = 0x0, pending_new_generation = 0x0, u = {client = {jobs = 0x0}}, replication_mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
__nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
simultaneous_replication_receivers = 1, replicating_inodes = {prev_inode = 0x7ffe94012750, next_inode = 0x7ffe94012750, prev_host = 0x0, next_host = 0x0,
peer = 0x0, dst = 0x0, handle = 0, inode = 0x0, igen = 0, cleanup = 0x0}, cookies = {id = 0, hcircleq = {hcqe_next = 0x7ffead400498, hcqe_prev = 0x7ffead400498}}}
Original comment by: n-soda
2.5 branch should be fixed in r6799.
Thanks Kiyoshi Imai-san for finding the cause!
Original comment by: n-soda
Original comment by: n-soda
Original comment by: n-soda
Replying to n-soda:
another symptom is reported by tatebe@-san as #490 - incomplete replica is left, if resolving hostname of the source filesystem node fails at a client-initiated replication
This was not a bug, but starvation due to memory shortage.
the summary field is changed to:
incomplete replica is left, when virtual memory space for mmap(2) is exhausted
Original comment by: n-soda
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Diff:
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Original comment by: n-soda
all of the subtickets have been closed until gfarm-2.5.8-rc1
Original comment by: n-soda
This is a meta ticket of "incomplete replica is left", possible causes are:
Same Ticket on Different Branches
Original Condition to Reproduce this problem
some replication doesn't complete, as follows:
configuration:
Reported by: n-soda
Original Ticket: gfarm/tickets/408