hariharan-devarajan / unifyfs-bug

0 stars 0 forks source link

Per-process files use same file name #3

Open MichaelBrim opened 1 year ago

MichaelBrim commented 1 year ago

https://github.com/hariharan-devarajan/unifyfs-bug/blob/a8f6ac153baa3871bcc7cb3a16ccfd44b1d3b7f5/bug.cpp#L19

@hariharan-devarajan What I see in the server logs is that the gfid for the transfer requests is the same for both of the transfers. I tracked this back to using the same filename (/unifyfs/test.dat) in both application ranks, which ends up causing two parallel transfers on the same file and things go badly. I see both transfer threads progressing but eventually the lassen22 log output just stops while both transfer threads have read about 1/4 of the file, which I suspect is due to a server process crash.

hariharan-devarajan commented 1 year ago

@MichaelBrim Thanks for the catch. I have different filenames below. I added prints to make sure now.

lassen30:49842] *** Process received signal ***
[lassen30:49842] Signal: Aborted (6)
[lassen30:49842] Signal code:  (-6)
creating /unifyfs/test.dat_1_of_2 rank 1
flushing to /dev/shm/test.dat_1_of_2 rank 1
Running transfer
2023-04-02T12:07:19 tid=22855 @ forward_to_server() [margo_client.c:234] margo_forward_timed() failed - HG_TIMEOUT
2023-04-02T12:07:19 tid=22855 @ invoke_client_transfer_rpc() [margo_client.c:615] forward of transfer rpc to server failed
unifyfs-bug: /g/g92/haridev/project/unifyfs-bug/bug.cpp:139: int main(int, char**): Assertion `rc == UNIFYFS_SUCCESS' failed.
creating /unifyfs/test.dat_0_of_2 rank 0
flushing to /dev/shm/test.dat_0_of_2 rank 0
Running transfer
[lassen29:22855] *** Process received signal ***
[lassen29:22855] Signal: Aborted (6)
[lassen29:22855] Signal code:  (-6)
<< Rank 1: Generating lwcore_cpu.4688505_20.1 on lassen30 Sun Apr  2 12:07:19 PDT 2023 (LLNL_COREDUMP_FORMAT_CPU=lwcore) >>
<< Rank 0: Generating lwcore_cpu.4688505_20.0 on lassen29 Sun Apr  2 12:07:19 PDT 2023 (LLNL_COREDUMP_FORMAT_CPU=lwcore) >>
<< Rank 0:  Generated lwcore_cpu.4688505_20.0 on lassen29 Sun Apr  2 12:07:20 PDT 2023 in 1 secs >>
<< Rank 1:  Generated lwcore_cpu.4688505_20.1 on lassen30 Sun Apr  2 12:07:20 PDT 2023 in 1 secs >>
<< Rank 0: Waiting 60 secs before aborting task on lassen29 Sun Apr  2 12:07:20 PDT 2023 (LLNL_COREDUMP_WAIT_FOR_OTHERS=60) >>
<< Rank 1: Waiting 60 secs before aborting task on lassen30 Sun Apr  2 12:07:20 PDT 2023 (LLNL_COREDUMP_WAIT_FOR_OTHERS=60) >>

It still has issues.