oss-tsukuba / gfarm

distributed file system for large-scale cluster computing and wide-area data sharing. provides fine-grained replica location control.
Other
31 stars 12 forks source link

[trunk] gfmd crashes with: <err> [1000425] pgsql_deadfilecopy_add: INSERT INTO DeadFileCopy ...[snip]...: ERROR: duplicate key value ...[snip]... #407

Open gfarm-admin opened 12 years ago

gfarm-admin commented 12 years ago
  1. set xattr "gfarm.ncopy" to 3
  2. invoke 100 gfarm2fs processes each on 2 hosts.
  3. sequentially create 1000 files on all gfarm2fs processes (i.e. 100 * 2 = 200 processes). since naming scheme of the files for both gfarm2fs is same, each file will be written 2 times.

(i.e. GFM_PROTO_CLOSE_WRITE will be issued 200,000 times)

this makes gfmd crashes as follows:

<warning> [1002479] file_replicating_new: host rgfsd107: resource temporarily unavailable
<warning> [1002479] file_replicating_new: host ossgfsd26: resource temporarily unavailable
<warning> [1002479] file_replicating_new: host univ-gfsd303: resource temporarily unavailable
<warning> [1002479] file_replicating_new: host univ-gfsd306: resource temporarily unavailable
<notice> [1000721] (_gfarmfs@ossgfsd37) authenticated: auth=gsi_auth local_user=@host@ DN="/C=JP/O=foo/OU=bar/CN=gfsd/ossgfsd37"
<notice> [1000721] (_gfarmfs@ossgfsd24) authenticated: auth=gsi_auth local_user=@host@ DN="/C=JP/O=foo/OU=bar/CN=gfsd/ossgfsd24"
<notice> [1000721] (_gfarmfs@ossgfsd18) authenticated: auth=gsi_auth local_user=@host@ DN="/C=JP/O=foo/OU=bar/CN=gfsd/ossgfsd18"
<notice> [1000286] (_gfarmfs@ossgfsd24) disconnected
<notice> [1000044] (_gfarmfs@univ-gfsd201) authenticated: auth=sharedsecret local_user=_gfarmfs
<err> [1002410] orphan replication (univ-gfsd201, 486977:25): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (univ-gfsd201, 513447:22): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (univ-gfsd201, 458760:31): s=2 d=0 size:0 maybe the connection had a problem?
<notice> [1000286] (_gfarmfs@ossgfsd42) disconnected
<err> [1002410] orphan replication (univ-gfsd201, 529975:21): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (univ-gfsd201, 530685:23): s=2 d=0 size:0 maybe the connection had a problem?
<notice> [1000286] (_gfarmfs@univ-gfsd301) disconnected
<notice> [1000286] (_gfarmfs@ossgfsd21) disconnected
<notice> [1000721] (_gfarmfs@ossgfsd27) authenticated: auth=gsi_auth local_user=@host@ DN="/C=JP/O=foo/OU=bar/CN=gfsd/ossgfsd27"
<err> [1002410] orphan replication (ossgfsd27, 455603:20): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (ossgfsd27, 457644:21): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (ossgfsd27, 139257:30): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (ossgfsd27, 458011:57): s=2 d=0 size:0 maybe the connection had a problem?
<notice> [1000721] (_gfarmfs@ossgfsd28) authenticated: auth=gsi_auth local_user=@host@ DN="/C=JP/O=foo/OU=bar/CN=gfsd/ossgfsd28"
<err> [1002410] orphan replication (ossgfsd27, 531192:25): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (ossgfsd27, 459289:19): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (ossgfsd27, 487431:19): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (ossgfsd27, 513294:31): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (ossgfsd27, 489114:19): s=2 d=0 size:0 maybe the connection had a problem?
<notice> [1000286] (_gfarmfs@ossgfsd30) disconnected
<notice> [1000286] (_gfarmfs@ossgfsd19) disconnected
<notice> [1000286] (_gfarmfs@ossgfsd18) disconnected
<notice> [1000286] (_gfarmfs@ossgfsd42) disconnected
<notice> [1000286] (_gfarmfs@ossgfsd38) disconnected
<notice> [1000286] (_gfarmfs@ossgfsd22) disconnected
<err> [1002410] orphan replication (univ-gfsd201, 532187:29): s=2 d=0 size:0 maybe the connection had a problem?
<notice> [1000286] (_gfarmfs@ossgfsd31) disconnected
<err> [1002410] orphan replication (univ-gfsd201, 487317:21): s=2 d=0 size:0 maybe the connection had a problem?
<notice> [1000286] (_gfarmfs@ossgfsd25) disconnected
<err> [1002410] orphan replication (ossgfsd27, 456041:20): s=2 d=0 size:0 maybe the connection had a problem?
<err> [1002410] orphan replication (ossgfsd27, 531028:21): s=2 d=0 size:0 maybe the connection had a problem?
<notice> [1000286] (_gfarmfs@univ-gfsd107) disconnected
<notice> [1000286] (_gfarmfs@ossgfsd21) disconnected
<notice> [1000286] (_gfarmfs@ossgfsd25) disconnected
<notice> [1000286] (_gfarmfs@ossgfsd38) disconnected
<err> [1002410] orphan replication (univ-gfsd201, 528663:23): s=2 d=0 size:0 maybe the connection had a problem?
<notice> [1000286] (_gfarmfs@ossgfsd23) disconnected
<notice> [1000286] (_gfarmfs@ossgfsd19) disconnected
<err> [1002257] error at 529032:31 replication to rgfsd014: src=71 dst=0
<info> [1002433] cannot remove an incomplete replica (rgfsd014, 529032:31): probably already removed
<err> [1002257] error at 530674:35 replication to univ-gfsd206: src=71 dst=0
<info> [1002433] cannot remove an incomplete replica (univ-gfsd206, 530674:35): probably already removed
<notice> [1000286] (_gfarmfs@ossgfsd44) disconnected
<warning> [1002479] file_replicating_new: host univ-gfsd404: resource temporarily unavailable
<warning> [1002479] file_replicating_new: host univ-gfsd307: resource temporarily unavailable
<warning> [1002479] file_replicating_new: host rgfsd108: resource temporarily unavailable
<warning> [1002479] file_replicating_new: host rgfsd009: resource temporarily unavailable
<warning> [1002479] file_replicating_new: host rgfsd008: resource temporarily unavailable
<warning> [1002479] file_replicating_new: host rgfsd012: resource temporarily unavailable
<warning> [1002479] file_replicating_new: host ossgfsd41: resource temporarily unavailable
<warning> [1002479] file_replicating_new: host rgfsd009: resource temporarily unavailable
<err> [1000425] pgsql_deadfilecopy_add: INSERT INTO DeadFileCopy (inumber, igen, hostname) VALUES ($1, $2, $3): ERROR:  duplicate key value violates unique constraint "deadfilecopy_pkey"#012DETAIL:  Key (inumber, igen, hostname)=(487610, 91, univ-gfsd203) already exists.
<err> [1003180] db_journal_store_thread : seqnum=34402547 ope=DEADFILECOPY_ADD : already exists
<err> [1003188] failed to store to db : already exists
<err> [1003397] gfmd is shutting down for unrecoverable error
<info> [1003405] backtrace symbols [1/5]: /opt/gfarm-2_5_5/lib/libgfarm.so.1(gfarm_log_backtrace_symbols+0x1f) [0x7f63f5c1f63f]
<info> [1003405] backtrace symbols [2/5]: /opt/gfarm-2_5_5/lib/libgfarm.so.1(gflog_fatal_message+0x81) [0x7f63f5c21ed1]
<info> [1003405] backtrace symbols [3/5]: /opt/gfarm-2_5_5/sbin/gfmd(db_journal_store_thread+0x25d) [0x45714d]
<info> [1003405] backtrace symbols [4/5]: /lib64/libpthread.so.0() [0x3ef4e077f1]
<info> [1003405] backtrace symbols [5/5]: /lib64/libc.so.6(clone+0x6d) [0x3ef46e570d]

configuration:


from reading the source code, 4 problematic cases are found as follows:

Reported by: n-soda

Original Ticket: gfarm/tickets/407

gfarm-admin commented 12 years ago

Original comment by: n-soda

gfarm-admin commented 12 years ago

on the 2.5 release branch,
PROBLEM-1, PROBLEM-2 and PROBLEM-3 are fixed in r6497,
and a workaround was added about PROBLEM-4 in r6498

Original comment by: n-soda

gfarm-admin commented 12 years ago

r6497 (for PROBLEM-1, PROBLEM-2 and PROBLEM-3) is merged into the main trunk in r6557,
but r6498 (for PROBLEM-4) isn't (to try to find better fix).

Original comment by: n-soda

gfarm-admin commented 12 years ago

Originally posted by: in reply to: #2; n-soda

Replying to n-soda:

on the 2.5 release branch,
and a workaround was added about PROBLEM-4 in r6498

r6498 is not enough, because, if it's a part of a transaction, other operation in the transaction fails as follows:

<warning> [1003506] pgsql_deadfilecopy_add: INSERT INTO DeadFileCopy (inumber, igen, hostname) VALUES ($1, $2, $3): ERROR:  duplicate key value violates unique constraint "deadfilecopy_pkey"#012DETAIL:  Key (inumber, igen, hostname)=(283010, 1731, rgfsd003) already exists.
<err> [1000426] pgsql_filecopy_remove: DELETE FROM FileCopy WHERE inumber = $1 AND hostname = $2: ERROR:  current transaction is aborted, commands ignored until end of transaction block
<err> [1003180] db_journal_store_thread : seqnum=261574806 ope=FILECOPY_REMOVE : unknown error
<err> [1003188] failed to store to db : unknown error
<err> [1003397] gfmd is shutting down for unrecoverable error
<info> [1003405] backtrace symbols [1/6]: /opt/gfarm-2.5.6/lib/libgfarm.so.1(gfarm_log_backtrace_symbols+0x1f) [0x7f78c682ce0f]
<info> [1003405] backtrace symbols [2/6]: /opt/gfarm-2.5.6/lib/libgfarm.so.1(gfarm_log_fatal_action+0x35) [0x7f78c682f925]
<info> [1003405] backtrace symbols [3/6]: /opt/gfarm-2.5.6/lib/libgfarm.so.1(gflog_fatal_message+0x83) [0x7f78c682fbd3]
<info> [1003405] backtrace symbols [4/6]: /opt/gfarm-2.5.6/sbin/gfmd(db_journal_store_thread+0x25d) [0x4599ed]
<info> [1003405] backtrace symbols [5/6]: /lib64/libpthread.so.0() [0x3ef4e077f1]
<info> [1003405] backtrace symbols [6/6]: /lib64/libc.so.6(clone+0x6d) [0x3ef46e570d]

Original comment by: *anonymous at SourceForge

gfarm-admin commented 12 years ago

Originally posted by: in reply to: #4; n-soda

Replying to n-soda:

r6498 is not enough, because, if it's a part of a transaction, other operation in the transaction fails as follows:

r6498 was backed out in r6598,
and alternative workaround written in SQL was added in r6599.
the SQL workaround should be far safer, because it make the transaction succeed.
one caveat is that this workaround completely hides the problem.

Original comment by: *anonymous at SourceForge

gfarm-admin commented 11 years ago

Original comment by: otatebe

gfarm-admin commented 11 years ago

Replying to n-soda:

and alternative workaround written in SQL was added in r6599.
the SQL workaround should be far safer, because it make the transaction succeed.
one caveat is that this workaround completely hides the problem.

r6599 was released as gfarm-2.5.7

Original comment by: n-soda

gfarm-admin commented 6 years ago

Diff:

Original comment by: n-soda

gfarm-admin commented 6 years ago

add "[trunk]" to the summary line, because other branches have been already fixed, the reason why it's not "[pullup-trunk}" but "[trunk]" is to investigate better fix to PROBLEM-4

Original comment by: n-soda