Open MichaelBrim opened 1 year ago
@MichaelBrim Thanks for the catch. I have different filenames below. I added prints to make sure now.
lassen30:49842] *** Process received signal ***
[lassen30:49842] Signal: Aborted (6)
[lassen30:49842] Signal code: (-6)
creating /unifyfs/test.dat_1_of_2 rank 1
flushing to /dev/shm/test.dat_1_of_2 rank 1
Running transfer
2023-04-02T12:07:19 tid=22855 @ forward_to_server() [margo_client.c:234] margo_forward_timed() failed - HG_TIMEOUT
2023-04-02T12:07:19 tid=22855 @ invoke_client_transfer_rpc() [margo_client.c:615] forward of transfer rpc to server failed
unifyfs-bug: /g/g92/haridev/project/unifyfs-bug/bug.cpp:139: int main(int, char**): Assertion `rc == UNIFYFS_SUCCESS' failed.
creating /unifyfs/test.dat_0_of_2 rank 0
flushing to /dev/shm/test.dat_0_of_2 rank 0
Running transfer
[lassen29:22855] *** Process received signal ***
[lassen29:22855] Signal: Aborted (6)
[lassen29:22855] Signal code: (-6)
<< Rank 1: Generating lwcore_cpu.4688505_20.1 on lassen30 Sun Apr 2 12:07:19 PDT 2023 (LLNL_COREDUMP_FORMAT_CPU=lwcore) >>
<< Rank 0: Generating lwcore_cpu.4688505_20.0 on lassen29 Sun Apr 2 12:07:19 PDT 2023 (LLNL_COREDUMP_FORMAT_CPU=lwcore) >>
<< Rank 0: Generated lwcore_cpu.4688505_20.0 on lassen29 Sun Apr 2 12:07:20 PDT 2023 in 1 secs >>
<< Rank 1: Generated lwcore_cpu.4688505_20.1 on lassen30 Sun Apr 2 12:07:20 PDT 2023 in 1 secs >>
<< Rank 0: Waiting 60 secs before aborting task on lassen29 Sun Apr 2 12:07:20 PDT 2023 (LLNL_COREDUMP_WAIT_FOR_OTHERS=60) >>
<< Rank 1: Waiting 60 secs before aborting task on lassen30 Sun Apr 2 12:07:20 PDT 2023 (LLNL_COREDUMP_WAIT_FOR_OTHERS=60) >>
It still has issues.
https://github.com/hariharan-devarajan/unifyfs-bug/blob/a8f6ac153baa3871bcc7cb3a16ccfd44b1d3b7f5/bug.cpp#L19
@hariharan-devarajan What I see in the server logs is that the gfid for the transfer requests is the same for both of the transfers. I tracked this back to using the same filename (
/unifyfs/test.dat
) in both application ranks, which ends up causing two parallel transfers on the same file and things go badly. I see both transfer threads progressing but eventually the lassen22 log output just stops while both transfer threads have read about 1/4 of the file, which I suspect is due to a server process crash.