Open cmharr opened 2 years ago
Hello,
it looks like the same is happening with dsync on my side with release 0.11.1:
[2022-05-24T21:56:55] Copied 7.226 TiB (46%) in 1825.398 secs (4.053 GiB/s) 2169 secs left ...
[2022-05-24T21:57:16] Copied 7.305 TiB (46%) in 1845.892 secs (4.052 GiB/s) 2150 secs left ...
[2022-05-24T21:57:34] Copied 7.388 TiB (47%) in 1863.928 secs (4.059 GiB/s) 2125 secs left ...
[2022-05-24T21:57:53] Copied 7.448 TiB (47%) in 1883.537 secs (4.049 GiB/s) 2115 secs left ...
[2022-05-24T21:58:10] Copied 7.534 TiB (48%) in 1900.580 secs (4.059 GiB/s) 2088 secs left ...
[2022-05-24T21:58:26] Copied 7.598 TiB (48%) in 1916.158 secs (4.060 GiB/s) 2071 secs left ...
# output stops, dsync continues
dsync is still running and doing its thing on mpi nodes.
This is the strace output on the node launching the dsync command through mpirun:
strace: Process 1429782 attached with 4 threads
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=1, tv_usec=175359} <unfinished ...>
[pid 1429785] select(19, [17<socket:[76548668]> 18<pipe:[76548669]>], NULL, NULL, {tv_sec=3578, tv_usec=804083} <unfinished ...>
[pid 1429783] epoll_wait(10<anon_inode:[eventpoll]>, <unfinished ...>
[pid 1429782] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 1429784] <... select resumed>) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 1429784] select(16, [14<socket:[76548666]> 15<pipe:[76548667]>], NULL, NULL, {tv_sec=2, tv_usec=0}^Cstrace: Process 1429782 detached
strace: Process 1429783 detached
strace: Process 1429784 detached
<detached ...>
strace: Process 1429785 detached
After some time, all data has been copied. dsync on MPI nodes are then not doing anything, looping at 100% cpu usage:
[pid 715200] epoll_wait(26<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(31<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(26<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(31<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(26<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(31<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(26<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(31<anon_inode:[eventpoll]>, [], 16, 0) = 0
[pid 715200] epoll_wait(26<anon_inode:[eventpoll]>, [], 16, 0) = 0
Timestamps are not applied on destination, hence files already on the destination are removed on another dsync pass.
Migrating conversation about missing progress messages from: https://github.com/hpc/mpifileutils/issues/560
The missing progress messages is a known problem.
These progress messages are implemented using a tree-based communication pattern across ranks using point-to-point messages. Every 10 seconds or so, rank 0 initiates a message down the tree to request progress. Each intermediate rank receives that request message from its parent and forwards it down to its children. When the request message reaches the leaves, ranks respond and send their current progress status back up the tree. Intermediate ranks receive that data, merge in their own progress status, and respond back up the tree. Rank 0 receives the messages from its children, merges its data, and prints the final result. These reductions are asynchronous. Ranks periodically check for and respond to these messages as they come in.
I think what probably happens for these progress messages to get stalled is that some rank (any rank) gets stuck in an I/O call. Because it is blocked in the I/O call, it doesn't respond to MPI messages, which then blocks the progress tree. If that's what's happening, one solution would be to move the I/O calls into a pthread or maybe use async I/O routines so that the process can still make MPI calls while waiting on its current I/O operation to complete.
From @adilger :
Is it possible to set a timeout on the MPI broadcast that doesn't also result in a total MPI failure? Since the status updates are somewhat advisory, unlike an HPC calculation, getting a partial status is probably fine.
Yes, I think timeouts on the reduction operation could also work. I think we'd need to pair that with adding sequence numbers to the messages or internal counters so that a parent can distinguish late-arriving responses from its children.
It might also get strange, since the progress "sum" would be missing any results from the subtree rooted at the blocked rank. The reported progress would appear to bounce up and down during the run if such stalls are frequent as random ranks come and go. That might be countered by caching the last reported value from each child. I'd have to check whether that makes sense for the various reduction ops...
There may be another advantage to moving the I/O to a thread. For algorithms that use libcircle for work stealing, work on blocked ranks could be moved to other ranks faster. Having said that, changing all of the tools to move I/O to a separate thread is a bigger overhaul.
Oh, wait. I'm describing the point-to-point reduction in libcircle:
https://github.com/hpc/libcircle/blob/406a0464795ef5db783971e578a798707ee79dc8/libcircle/token.c#L97
There is a separate progress reduction algorithm in mpiFileUtils, and I forgot that it uses non-blocking MPI collectives for that:
https://github.com/hpc/mpifileutils/blob/main/src/common/mfu_progress.c
That'd probably first need to be switched over to a point-to-point method (like in libcircle) to add timeouts.
When running drm on a large directory, I see it behave normally for some time and then cease to report, even though it continues to work. This happened two runs in a row; at the start of the first run, it was updating output every 12-14 seconds, and then it stopped reporting for over an hour and I manually killed the job. I re-ran the job and as shown below it ran for a little while before it too had its output hang for at least 50 sec. Note that the job does continue to run even when it's not reporting, as evidenced through monitoring unlink() in
ltop
mpifileutils-0.11.1-0.11.1-2.tce.ch6.x86_64