incomplete replica is left (meta ticket)

gfarm-admin commented 12 years ago

This is a meta ticket of "incomplete replica is left", possible causes are:

#476 - incomplete replica is left, if memory shortage happens in remove_replica_entity() - this is the problem in comment:1 of this ticket / comment:1 (r6567 for trunk, r6568 for 2.5)
#478 - incomplete replica is left, if a back channel is disconnected just before a reply from gfmd to gfsd / problem 1 of comment:3 (r6664 for 2.5, r6665 for trunk)
#479 - incomplete replica is left due to a chicken-or-egg problem to maintain a reference count of struct peer / comment:10 and comment:12 - (r6694 and r6695 - both 2.5 only) / NOTE: comment:5 (r6670 for 2.5, r6671 for trunk) is backed out in comment:8 (r6690 for 2.5, r6687 for trunk), and part of comment:5 was re-applied as #454 (theoretical race condition about peer_free())
#480 - incomplete replica is left due to missing decrement of a reference count of struct peer / comment:6 (r6672 for 2.5 only)
#481 - incomplete replica is left at disconnection of a back channel, if the disconnection happens after successful GFS_PROTO_REPLICATION_REQUEST but before GFM_PROTO_REPLICATION_RESULT / comment:13 (r6799 for 2.5, not fixed on the trunk)
#482 - incomplete replica is left due to missing reply about GFS_PROTO_REPLICATION_REQUEST from gfsd to gfmd / comment:18 (r6807 for trunk, r6809 for 2.5)
#490 - incomplete replica is left, when virtual memory space for mmap(2) is exhausted / comment:26 (not a bug, just memory shortage)
#500 - incomplete replica is left, if malloc(3) fails / comment:27 (r6898 for trunk, r6909 for 2.5)

Same Ticket on Different Branches

2.5.7: #485

Original Condition to Reproduce this problem

set xattr "gfarm.ncopy" to 3
invoke one gfarm2fs process
sequentially create 1000 1GB-files under the gfarm2fs

some replication doesn't complete, as follows:

% gfwhere -al test_file.931:
    rgfsd007            26  - - -
    univ-gfsd402            26  - i -
    ossgfsd29           26  - - -

configuration:

gfarm version 2.5.5
metadata journaling enabled
number of filesystem nodes: 89

Reported by: n-soda

Original Ticket: gfarm/tickets/408

gfarm-admin commented 12 years ago

one possible cause is fixed in r6567 on the main trunk, and r6568 on the 2.5 release branch,
but we are sure that this is not really the cause of this ticket,
because the following message was not logged in the environment:

<err> [1002260] removing old replica <inum>:<igen> host <host>: no memory

Original comment by: n-soda

gfarm-admin commented 12 years ago

Kiyoshi Imai-san found that reference count of struct peer is not correctly maintained.
i.e. the value is too high.
RPC-disconnect-callback is not called due to this problem.

Original comment by: n-soda

gfarm-admin commented 12 years ago

Originally posted by: in reply to: #2; n-soda

version set to 2.4.0

Replying to n-soda:

Kiyoshi Imai-san found that reference count of struct peer is not correctly maintained.
i.e. the value is too high.

There are two places where struct peer leaks, found by Kiyoshi Imai-san.

peer != peer0 case of gfm_server_channel_vput_reply() on the 2.5 branch and async_server_vput_wrapped_reply() on the main trunk,
host_replicating_new() on the 2.5 branch since r6534 (gfarm-2.5.6). the main trunk doesn't have this problem.

Fixing problem 2 is somewhat hard, because it's a chicken-or-egg problem.

Original comment by: *anonymous at SourceForge

gfarm-admin commented 12 years ago

Replying to n-soda:

1. peer != peer0 case of gfm_server_channel_vput_reply() on the 2.5 branch and async_server_vput_wrapped_reply() on the main trunk,

The problem 1 is fixed in r6664 on the 2.5 branch, and r6665 on the main trunk.
Thus, the main trunk may not have this issue anymore.

2. host_replicating_new() on the 2.5 branch since r6534 (gfarm-2.5.6). the main trunk doesn't have this problem.

Fixing problem 2 is somewhat hard, because it's a chicken-or-egg problem.

This is not fixed yet, thus, 2.5 branch still has the issue.

Original comment by: n-soda

gfarm-admin commented 12 years ago

problem 2 is fixed in r6670 on the 2.5 release branch, and in r6671 on the main trunk.

Original comment by: n-soda

gfarm-admin commented 12 years ago

owner set to n-soda
status changed from new to accepted

another missing peer_del_ref() was found in host_replicating_new().
this problem was introduced as a fix for #426 - race condition about peer_free() makes gfmd crash,
and resovled in r6672.

Original comment by: n-soda

gfarm-admin commented 12 years ago

the fix for problem 2 (r6670 and r6671) was wrong,
because the async-RPC-cleanup routine does DB access, but the code doesn't acquire the giant lock
(the code before the change had acquired the giant lock before calling peer_free().

problem found by Kiyoshi Imai-san

Original comment by: n-soda

gfarm-admin commented 12 years ago

Replying to n-soda:

the fix for problem 2 (r6670 and r6671) was wrong,

r6670 (2.5) and r6671 (trunk) are backed out in r6690 (2.5) and r6687 (trunk)

Original comment by: n-soda

gfarm-admin commented 12 years ago

part of the changes in r6670 (2.5) and r6671 (trunk) is re-applied,
see #454 - theoretical race condition about peer_free()

Original comment by: n-soda

gfarm-admin commented 12 years ago

the problem 2 is fixed in r6694 on the 2.5 release branch.

NOTE:
We don't have a plan to apply this change to the main trunk,
because the main trunk doesn't have the problem 2,
and this change is somewhat ugly.

Original comment by: n-soda

gfarm-admin commented 12 years ago

Replying to n-soda:

the fix for problem 2 (r6670 and r6671) was wrong,
because the async-RPC-cleanup routine does DB access, but the code doesn't acquire the giant lock

not only that, it has race condition that peer->async may be cleared during abstract_host_sender_lock() or abstract_host_receiver_lock() is acquired.
the async-RPC-cleanup rountine has to be called after the peer->refcount becomes 0 to prevent such race condition.

Original comment by: n-soda

gfarm-admin commented 12 years ago

r6694 had a bug (missing initialization), and that's fixed in r6695

Original comment by: n-soda

gfarm-admin commented 12 years ago

the following symptom is still observed. From Kiyoshi Imai-san:

$ gfwhere -arl /|grep -- '- i'
        dhcp-167-248         284      - i -
        dhcp-167-248         224      - i -
        dhcp-167-248         294      - i -
        dhcp-167-248         255      - i -
        dhcp-167-248         185      - i -
        dhcp-167-248         187      - i -
        dhcp-167-248        1164      - i -

(gdb) p *peer_closing_queue
Structure has no component named operator*.
(gdb) p peer_closing_queue
$7 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
    __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}, ready_to_close = {__data = {__lock = 0, __futex = 4707107,
      __total_seq = 2353554, __wakeup_seq = 2353553, __woken_seq = 2353553, __mutex = 0x693480, __nwaiters = 2, __broadcast_seq = 0},
    __size = "\000\000\000\000#\323G\000\222\351#\000\000\000\000\000\221\351#\000\000\000\000\000\221\351#\000\000\000\000\000\200\064i\000\000\000\000\000\002\000\000\000\000\000\000", __align = 20216870623772672}, head = 0x7ffead400350, tail = 0x7ffead400350}
(gdb) p *peer_closing_queue->head
$8 = {next_close = 0x0, refcount = 0, replication_refcount = 1, conn = 0x11b5e60, async = 0x0, id_type = GFARM_AUTH_ID_TYPE_SPOOL_HOST,
  username = 0x7ffea0000d00 "_gfarmfs", hostname = 0x0, user = 0x0, host = 0xfa92f0, process = 0x0, protocol_error = 1, protocol_error_mutex = {__data = {__lock = 0,
      __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
  watcher = 0xfa8770, readable_event = 0x11bdfb0, pstate = {nesting_level = 0, cs = {current_part = 0, cause = 0, skip = 0}}, fd_current = -1, fd_saved = -1,
  flags = 0, findxmlattrctx = 0x0, pending_new_generation = 0x0, u = {client = {jobs = 0x0}}, replication_mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
      __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
  simultaneous_replication_receivers = 1, replicating_inodes = {prev_inode = 0x7ffe94012750, next_inode = 0x7ffe94012750, prev_host = 0x0, next_host = 0x0,
    peer = 0x0, dst = 0x0, handle = 0, inode = 0x0, igen = 0, cleanup = 0x0}, cookies = {id = 0, hcircleq = {hcqe_next = 0x7ffead400498, hcqe_prev = 0x7ffead400498}}}

Original comment by: n-soda

gfarm-admin commented 12 years ago

2.5 branch should be fixed in r6799.
Thanks Kiyoshi Imai-san for finding the cause!

Original comment by: n-soda

gfarm-admin commented 12 years ago

Replying to n-soda:

2.5 branch should be fixed in r6799.
Thanks Kiyoshi Imai-san for finding the cause!

Imai-san tested r6799, and although it improved the situation,
some incomplete replicas still exist.
so, there must be another cause. ;-/

Original comment by: n-soda

gfarm-admin commented 12 years ago

r6799 made a minor mistake about diag message, and that's fixed in r6800.
(no functional change)

Original comment by: n-soda

gfarm-admin commented 12 years ago

description modified (diff)

start to convert this ticket to a meta ticket: split one of the problems to #476 as the first step

Original comment by: n-soda

gfarm-admin commented 12 years ago

the last cause (missing reply about GFS_PROTO_REPLICATION_REQUEST from gfsd to gfmd) is found and fixed by Kiyoshi Imai-san (Thanks!),
and that's commited as r6807 (trunk) and r6809 (2.5)

Original comment by: n-soda

gfarm-admin commented 12 years ago

Replying to n-soda:

and that's commited as r6807 (trunk) and r6809 (2.5)

this had an error which was made by me,
and that's fixed in r6810 (trunk) and r6811 (2.5).
the error was found by Kiyoshi Imai-san.

Original comment by: n-soda

gfarm-admin commented 12 years ago

description modified (diff)
summary changed from incomplete replica is left to incomplete replica is left (meta ticket)

split into #478, #479, #480, #481 and #482 as well.

Original comment by: n-soda

gfarm-admin commented 12 years ago

description modified (diff)

Original comment by: n-soda

gfarm-admin commented 12 years ago

description modified (diff)

another symptom is reported by tatebe@-san as #490 - incomplete replica is left, if resolving hostname of the source filesystem node fails at a client-initiated replication

Original comment by: n-soda

gfarm-admin commented 12 years ago

description modified (diff)

Original comment by: n-soda

gfarm-admin commented 12 years ago

description modified (diff)

add #500 - incomplete replica is left, if malloc(3) fails

Original comment by: n-soda

gfarm-admin commented 12 years ago

Originally posted by: in reply to: #24; n-soda

description modified (diff)

Replying to n-soda:

add #500 - incomplete replica is left, if malloc(3) fails

fixed in r6898 (only exists on the main trunk)

Original comment by: *anonymous at SourceForge

gfarm-admin commented 12 years ago

description modified (diff)

Replying to n-soda:

another symptom is reported by tatebe@-san as #490 - incomplete replica is left, if resolving hostname of the source filesystem node fails at a client-initiated replication

This was not a bug, but starvation due to memory shortage.
the summary field is changed to:
incomplete replica is left, when virtual memory space for mmap(2) is exhausted

Original comment by: n-soda

gfarm-admin commented 12 years ago

description modified (diff)

Replying to n-soda:

Replying to n-soda:

add #500 - incomplete replica is left, if malloc(3) fails

fixed in r6898 (only exists on the main trunk)

another problem (which exists only on 2.5 and 2.5.7 branch) was fixed in r6909 (2.5 branch)

Original comment by: n-soda

gfarm-admin commented 10 years ago

Milestone: gfarm-2.5.8.4 --> gfarm-2.5.8.5

Original comment by: otatebe

gfarm-admin commented 10 years ago

Milestone: gfarm-2.5.8.5 --> gfarm-2.5.8.6

Original comment by: otatebe

gfarm-admin commented 10 years ago

Milestone: gfarm-2.5.8.6 --> gfarm-2.5.8.7

Original comment by: otatebe

gfarm-admin commented 10 years ago

Description has changed:

Diff:

Milestone: gfarm-2.5.8.7 --> gfarm-2.5.8.8

Original comment by: otatebe

gfarm-admin commented 10 years ago

Milestone: gfarm-2.5.8.8 --> gfarm-2.5.8.9

Original comment by: otatebe

gfarm-admin commented 10 years ago

Milestone: gfarm-2.5.8.9 --> gfarm-2.5.8.10

Original comment by: otatebe

gfarm-admin commented 10 years ago

Milestone: gfarm-2.5.8.10 --> gfarm-2.5.8.11

Original comment by: otatebe

gfarm-admin commented 10 years ago

Milestone: gfarm-2.5.8.11 --> gfarm-2.5.8.12

Original comment by: otatebe

gfarm-admin commented 9 years ago

Milestone: gfarm-2.5.8.12 --> gfarm-2.5.8.13

Original comment by: otatebe

gfarm-admin commented 9 years ago

Milestone: gfarm-2.5.8.13 --> gfarm-2.5.8.14

Original comment by: otatebe

gfarm-admin commented 6 years ago

status: accepted --> closed
Milestone: gfarm-2.5.8.14 --> gfarm-2.5.8-rc1
Resolution: --> fixed

Original comment by: n-soda

gfarm-admin commented 6 years ago

all of the subtickets have been closed until gfarm-2.5.8-rc1

Original comment by: n-soda

oss-tsukuba / gfarm

incomplete replica is left (meta ticket) #408

Same Ticket on Different Branches

Original Condition to Reproduce this problem