Open gfarm-admin opened 12 years ago
Original comment by: n-soda
a workaround is added in r6469 and r6470 on the main trunk,
and in r6471 on the 2.5 release branch as follows:
now gfsd and gfmd logs this condition (gfmd's log is not guaranteed though),
thus a system administrator can rescue the replica
(because the replica should remain with old generation number),
it's recommended that a monitoring system watches this error logs.
Original comment by: n-soda
Original comment by: n-soda
Original comment by: n-soda
Original comment by: n-soda
Original comment by: n-soda
Original comment by: n-soda
Original comment by: n-soda
Original comment by: n-soda
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Diff:
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
Original comment by: otatebe
this happened when the load average of gfmd is extremely high,
and a gfsd closed its network connection to gfmd
due to the network_receive_timeout condition
(i.e. the gfmd nearly hung due to the high load average),
and #420 (a site-wide network failure) happend at the same time.
If this problem happens, one of the following message numbers will be logged with "error occurred during close operation for writing" on the filesystem node:
In that case, the following operations are strongly recommended:
For all problems which may cause "lost all replicas", see a meta ticket #474
Reported by: n-soda
Original Ticket: gfarm/tickets/419