hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
162 stars 64 forks source link

drm stops reporting output #530

Open cmharr opened 2 years ago

cmharr commented 2 years ago

When running drm on a large directory, I see it behave normally for some time and then cease to report, even though it continues to work. This happened two runs in a row; at the start of the first run, it was updating output every 12-14 seconds, and then it stopped reporting for over an hour and I manually killed the job. I re-ran the job and as shown below it ran for a little while before it too had its output hang for at least 50 sec. Note that the job does continue to run even when it's not reporting, as evidenced through monitoring unlink() in ltop

[root@slic:alustre2]# srun -p util -n 216 drm /p/alustre2/buser.old
srun: job 112172 queued and waiting for resources
srun: job 112172 has been allocated resources
[2022-05-24T09:17:20] Walking /p/alustre2/busr.old
[2022-05-24T09:17:38] Walked 21877426 items in 18.086 secs (1209623.536 items/sec) ...
[2022-05-24T09:17:39] Walked 40376165 items in 18.448 secs (2188636.847 items/sec) ...
[2022-05-24T09:17:39] Walked 40376165 items in 18.570 seconds (2174271.662 items/sec)
[2022-05-24T09:17:39] Removing 40376165 items
[2022-05-24T09:18:25] Removed 182649 items (0.45%) in 45.393 secs (4023.714 items/sec) 9989 secs remaining ...
[2022-05-24T09:19:06] Removed 689305 items (1.71%) in 86.576 secs (7961.866 items/sec) 4984 secs remaining ...
[2022-05-24T09:19:47] Removed 1225742 items (3.04%) in 127.618 secs (9604.773 items/sec) 4076 secs remaining ...
[2022-05-24T09:20:28] Removed 1704303 items (4.22%) in 168.277 secs (10127.956 items/sec) 3818 secs remaining ...
[2022-05-24T09:21:05] Removed 2118937 items (5.25%) in 205.911 secs (10290.566 items/sec) 3717 secs remaining ...
[2022-05-24T09:21:51] Removed 2444324 items (6.05%) in 251.843 secs (9705.760 items/sec) 3908 secs remaining ...
[2022-05-24T09:22:30] Removed 2797891 items (6.93%) in 291.149 secs (9609.825 items/sec) 3910 secs remaining ...
[2022-05-24T09:23:12] Removed 3110872 items (7.70%) in 332.336 secs (9360.616 items/sec) 3981 secs remaining ...
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=112172.0 tasks 0-215: running
^Csrun: sending Ctrl-C to StepId=112172.0
srun: forcing job termination
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** STEP 112172.0 ON slic13 CANCELLED AT 2022-05-24T10:11:50 ***

mpifileutils-0.11.1-0.11.1-2.tce.ch6.x86_64

jbd commented 2 years ago

Hello,

it looks like the same is happening with dsync on my side with release 0.11.1:

[2022-05-24T21:56:55] Copied 7.226 TiB (46%) in 1825.398 secs (4.053 GiB/s) 2169 secs left ...
[2022-05-24T21:57:16] Copied 7.305 TiB (46%) in 1845.892 secs (4.052 GiB/s) 2150 secs left ...
[2022-05-24T21:57:34] Copied 7.388 TiB (47%) in 1863.928 secs (4.059 GiB/s) 2125 secs left ...
[2022-05-24T21:57:53] Copied 7.448 TiB (47%) in 1883.537 secs (4.049 GiB/s) 2115 secs left ...
[2022-05-24T21:58:10] Copied 7.534 TiB (48%) in 1900.580 secs (4.059 GiB/s) 2088 secs left ...
[2022-05-24T21:58:26] Copied 7.598 TiB (48%) in 1916.158 secs (4.060 GiB/s) 2071 secs left ...
# output stops, dsync continues

dsync is still running and doing its thing on mpi nodes.

This is the strace output on the node launching the dsync command through mpirun:

strace: Process 1429782 attached with 4 threads
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=1, tv_usec=175359} <unfinished ...>
[pid 1429785] select(19, [17<socket:[76548668]> 18<pipe:[76548669]>], NULL, NULL, {tv_sec=3578, tv_usec=804083} <unfinished ...>
[pid 1429783] epoll_wait(10<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 1429782] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 1429784] <... select resumed>)     = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}^Cstrace: Process 1429782 detached
strace: Process 1429783 detached
strace: Process 1429784 detached
 <detached ...>
strace: Process 1429785 detached
jbd commented 2 years ago

After some time, all data has been copied. dsync on MPI nodes are then not doing anything, looping at 100% cpu usage:

[pid 715200] epoll_wait(26<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(31<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(26<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(31<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(26<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(31<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(26<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(31<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(26<anon_inode:[eventpoll]>, [], 16, 0) = 0

Timestamps are not applied on destination, hence files already on the destination are removed on another dsync pass.

adammoody commented 9 months ago

Migrating conversation about missing progress messages from: https://github.com/hpc/mpifileutils/issues/560

The missing progress messages is a known problem.

These progress messages are implemented using a tree-based communication pattern across ranks using point-to-point messages. Every 10 seconds or so, rank 0 initiates a message down the tree to request progress. Each intermediate rank receives that request message from its parent and forwards it down to its children. When the request message reaches the leaves, ranks respond and send their current progress status back up the tree. Intermediate ranks receive that data, merge in their own progress status, and respond back up the tree. Rank 0 receives the messages from its children, merges its data, and prints the final result. These reductions are asynchronous. Ranks periodically check for and respond to these messages as they come in.

I think what probably happens for these progress messages to get stalled is that some rank (any rank) gets stuck in an I/O call. Because it is blocked in the I/O call, it doesn't respond to MPI messages, which then blocks the progress tree. If that's what's happening, one solution would be to move the I/O calls into a pthread or maybe use async I/O routines so that the process can still make MPI calls while waiting on its current I/O operation to complete.

adammoody commented 9 months ago

From @adilger :

Is it possible to set a timeout on the MPI broadcast that doesn't also result in a total MPI failure? Since the status updates are somewhat advisory, unlike an HPC calculation, getting a partial status is probably fine.

adammoody commented 9 months ago

Yes, I think timeouts on the reduction operation could also work. I think we'd need to pair that with adding sequence numbers to the messages or internal counters so that a parent can distinguish late-arriving responses from its children.

It might also get strange, since the progress "sum" would be missing any results from the subtree rooted at the blocked rank. The reported progress would appear to bounce up and down during the run if such stalls are frequent as random ranks come and go. That might be countered by caching the last reported value from each child. I'd have to check whether that makes sense for the various reduction ops...

There may be another advantage to moving the I/O to a thread. For algorithms that use libcircle for work stealing, work on blocked ranks could be moved to other ranks faster. Having said that, changing all of the tools to move I/O to a separate thread is a bigger overhaul.

adammoody commented 9 months ago

Oh, wait. I'm describing the point-to-point reduction in libcircle:

https://github.com/hpc/libcircle/blob/406a0464795ef5db783971e578a798707ee79dc8/libcircle/token.c#L97

There is a separate progress reduction algorithm in mpiFileUtils, and I forgot that it uses non-blocking MPI collectives for that:

https://github.com/hpc/mpifileutils/blob/main/src/common/mfu_progress.c

That'd probably first need to be switched over to a point-to-point method (like in libcircle) to add timeouts.