hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
162 stars 64 forks source link

dsync not updating times correctly #560

Open dvickernasa opened 9 months ago

dvickernasa commented 9 months ago

Hello,

I recently discovered mpifileutils and they are awesome. I moved about 200 TB in 4 million files between two different lustre filesystems over the weekend. Thank you for your work on these tools. After doing an initial data move, my intent was to keep the data in sync for a period of time using rsync. With the bulk of the data transferred, this should go fast but it would also catch any file types that `dsync doesn't handle (hard links, etc.). Here is the script I'm using.

source=/nobackup/PROJECTS/ccdev/
dest=/nobackup2/PROJECTS/ccdev/

mpiexec --allow-run-as-root dsync -X non-lustre $source $dest > dsync-ccdev.out.$( date +%F )

rsync -aPHASx $source $dest > rsync-ccdev.out.$( date +%F )

The problem I'm running into is that rsync is copying files that dync already copied. Here is one example of a file that rsync is trying to transfer again.

Source:

238412051 -rw-r----- 1 dvicker eg3 244210724864 Aug 23 16:38 /nobackup/PROJECTS/ccdev/boeing/aero/adb/ar/ar-204.squashfs

Destination:

238487228 -rw-r----- 1 dvicker eg3 244210724864 Sep 29 17:15 /nobackup2/PROJECTS/ccdev/boeing/aero/adb/ar/ar-204.squashfs

The file size it correct but it looks like the date on the destination is wrong - this is the date that that the file got created. After poking around in the destination more, it looks like this is prevalent - dates were not preserved. This will trip up rsync's default algorithm of checking by date and size. I could switch to rsync's checksum mode to get around this but that will be quite a bit more time consuming for this dataset.

Is this a bug with dsync or am I possibly doing something wrong? I'm using version 0.11.1 of mpifileutils.

dvickernasa commented 9 months ago

Changed the subject since the root problem here is the time stamp info not being preserved.

Here is a more comprehensive example. There is an odd mix of correct and incorrect time stamps in the destination.

[dvicker@monte dsync]$ ls -l /nobackup/PROJECTS/ccdev/spacex/aero/adb/ar/
total 24040207564
  781477845 -rw-r----- 1 dvicker  ccp_spacex   791038201856 Oct 27  2022 ar-108.squashfs
        187 -rw-rw---- 1 dvicker  ccp_spacex         742598 Aug 15 11:12 ar-108.squashfs.toc
 5835780999 -rw-rw---- 1 ema      ccp_spacex  5977433890816 Aug 24 03:22 ar-109.squashfs
       1721 -rw-rw---- 1 ema      ccp_spacex        9413597 Aug 24 03:22 ar-109.squashfs.toc
   30794854 -rw-r----- 1 dvicker  ccp_spacex    25150177280 Oct 29  2022 ar-126.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          89461 Aug 15 11:12 ar-126.squashfs.toc
  561289162 -rw-r----- 1 dvicker  ccp_spacex   568532955136 Oct 29  2022 ar-127.squashfs
        233 -rw-rw---- 1 dvicker  ccp_spacex        1003455 Aug 15 11:12 ar-127.squashfs.toc
 1042993450 -rw-r----- 1 dvicker  ccp_spacex  1056351469568 Oct 29  2022 ar-131.squashfs
       2651 -rw-rw---- 1 dvicker  ccp_spacex       14523295 Aug 15 11:13 ar-131.squashfs.toc
  882730771 -rw-rw---- 1 ema      ccp_spacex   899033808896 Aug 24 11:42 ar-132.squashfs
       4046 -rw-rw---- 1 ema      ccp_spacex       22383607 Aug 24 11:42 ar-132.squashfs.toc
 1777595522 -rw-r----- 1 dvicker  ccp_spacex  1815639597056 Oct 28  2022 ar-133.squashfs
       1721 -rw-rw---- 1 dvicker  ccp_spacex        9352445 Aug 15 11:13 ar-133.squashfs.toc
  101115096 -rw-r----- 1 dvicker  ccp_spacex   102275829760 Oct 28  2022 ar-134.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          93327 Aug 15 11:13 ar-134.squashfs.toc
  753055384 -rw-r----- 1 dvicker  ccp_spacex   771118510080 Oct 30  2022 ar-135.squashfs
        466 -rw-rw---- 1 dvicker  ccp_spacex        2126037 Aug 15 11:13 ar-135.squashfs.toc
  158022760 -rw-r----- 1 dvicker  ccp_spacex   161615790080 Oct 29  2022 ar-136.squashfs
         94 -rw-rw---- 1 dvicker  ccp_spacex         225492 Aug 15 11:13 ar-136.squashfs.toc
11896898702 -rw-r----- 1 dvicker  ccp_spacex 12047965007872 Aug 20 22:27 ar-138.squashfs
      32762 -rw-rw---- 1 dvicker  ccp_spacex       64656431 Aug 20 22:27 ar-138.squashfs.toc
   46078532 -rw-r----- 1 dvicker  ccp_spacex    38567428096 Oct 27  2022 ar-140.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          99596 Aug 15 11:13 ar-140.squashfs.toc
   69136835 -rw-r----- 1 dvicker  ccp_spacex    64205635584 Oct 28  2022 ar-144.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          61816 Aug 15 11:13 ar-144.squashfs.toc
  103192925 -rw-r--r-- 1 dvicker  eg3          105696878592 Aug 22 20:03 Dragon2_RCS_JI_Joe_MSFC.squashfs
        629 -rw-r--r-- 1 dvicker  eg3               3418271 Aug 22 20:03 Dragon2_RCS_JI_Joe_MSFC.squashfs.toc
         36 -rwxr-xr-x 1 aschwing ccp_spacex            203 Feb 27  2019 update_perms.csh*
[dvicker@monte dsync]$ ls -l /nobackup2/PROJECTS/ccdev/spacex/aero/adb/ar/
total 23852316152
  772498852 -rw-r----- 1 dvicker  ccp_spacex   791038201856 Oct 27  2022 ar-108.squashfs
        728 -rw-rw---- 1 dvicker  ccp_spacex         742598 Sep 29 22:48 ar-108.squashfs.toc
 5837342120 -rw-rw---- 1 ema      ccp_spacex  5977433890816 Aug 24 03:22 ar-109.squashfs
       9196 -rw-rw---- 1 ema      ccp_spacex        9413597 Sep 29 22:03 ar-109.squashfs.toc
   24560752 -rw-r----- 1 dvicker  ccp_spacex    25150177280 Sep 29 20:41 ar-126.squashfs
         88 -rw-rw---- 1 dvicker  ccp_spacex          89461 Aug 15 11:12 ar-126.squashfs.toc
  555208412 -rw-r----- 1 dvicker  ccp_spacex   568532955136 Sep 29 23:53 ar-127.squashfs
        980 -rw-rw---- 1 dvicker  ccp_spacex        1003455 Sep 30 00:05 ar-127.squashfs.toc
 1031594016 -rw-r----- 1 dvicker  ccp_spacex  1056351469568 Oct 29  2022 ar-131.squashfs
      14184 -rw-rw---- 1 dvicker  ccp_spacex       14523295 Sep 29 17:44 ar-131.squashfs.toc
  877963384 -rw-rw---- 1 ema      ccp_spacex   899033808896 Sep 29 19:22 ar-132.squashfs
      21860 -rw-rw---- 1 ema      ccp_spacex       22383607 Sep 30 00:14 ar-132.squashfs.toc
 1773086904 -rw-r----- 1 dvicker  ccp_spacex  1815639597056 Sep 29 20:43 ar-133.squashfs
       9136 -rw-rw---- 1 dvicker  ccp_spacex        9352445 Aug 15 11:13 ar-133.squashfs.toc
   99878836 -rw-r----- 1 dvicker  ccp_spacex   102275829760 Sep 30 00:08 ar-134.squashfs
         92 -rw-rw---- 1 dvicker  ccp_spacex          93327 Aug 15 11:13 ar-134.squashfs.toc
  753046000 -rw-r----- 1 dvicker  ccp_spacex   771118510080 Oct 30  2022 ar-135.squashfs
       2080 -rw-rw---- 1 dvicker  ccp_spacex        2126037 Sep 29 22:02 ar-135.squashfs.toc
  157828056 -rw-r----- 1 dvicker  ccp_spacex   161615790080 Sep 29 23:49 ar-136.squashfs
        224 -rw-rw---- 1 dvicker  ccp_spacex         225492 Sep 29 17:19 ar-136.squashfs.toc
11765599492 -rw-r----- 1 dvicker  ccp_spacex 12047965007872 Aug 20 22:27 ar-138.squashfs
      63144 -rw-rw---- 1 dvicker  ccp_spacex       64656431 Sep 29 23:49 ar-138.squashfs.toc
   37663536 -rw-r----- 1 dvicker  ccp_spacex    38567428096 Sep 29 16:54 ar-140.squashfs
        100 -rw-rw---- 1 dvicker  ccp_spacex          99596 Sep 29 22:03 ar-140.squashfs.toc
   62700864 -rw-r----- 1 dvicker  ccp_spacex    64205635584 Oct 28  2022 ar-144.squashfs
         64 -rw-rw---- 1 dvicker  ccp_spacex          61816 Sep 30 02:00 ar-144.squashfs.toc
  103219708 -rw-r--r-- 1 dvicker  eg3          105696878592 Sep 30 02:26 Dragon2_RCS_JI_Joe_MSFC.squashfs
       3340 -rw-r--r-- 1 dvicker  eg3               3418271 Sep 29 16:46 Dragon2_RCS_JI_Joe_MSFC.squashfs.toc
          4 -rwxr-xr-x 1 aschwing ccp_spacex            203 Sep 29 16:46 update_perms.csh*
[dvicker@monte dsync]$ 
adammoody commented 9 months ago

Hi, @dvickernasa . Great to hear mpiFileUtils has been helpful. Thanks.

dsync should be setting the atime and mtime on the destination files to match the source file. However, it does that as the very last step, after all files have been copied. I wonder if the job might have been interrupted or hit a fatal error before it could complete that step.

Do you see any such errors or early termination from your job output log?

Tip: When copying a large number of files, I like to recommend using the --batch-files option, as in:

dsync --batch-files 100000 ...

This changes dsync to copy files in batches rather than all files at once. It sets the timestamps at the completion of each batch. If the job is interrupted, one can then execute the same dsync command again. dsync skips copying any files that were completed in earlier runs. By default, it avoids copying any file whose destination exists and has a matching file type, size, and mtime.

dvickernasa commented 9 months ago

That is possible. Here is the end of the output from the dsync run.

[2023-09-30T08:19:28] Syncing directory updates to disk.
[2023-09-30T08:19:33] Sync completed in 4.942 seconds. 
[2023-09-30T08:19:33] Started: Sep-29-2023,16:34:18
[2023-09-30T08:19:33] Completed: Sep-30-2023,08:19:33
[2023-09-30T08:19:33] Seconds: 56715.151
[2023-09-30T08:19:33] Items: 1076815
[2023-09-30T08:19:33]   Directories: 35829
[2023-09-30T08:19:33]   Files: 1037753
[2023-09-30T08:19:33]   Links: 3233
[2023-09-30T08:19:33] Data: 48.152 TiB (52944066031967 bytes)
[2023-09-30T08:19:33] Rate: 890.263 MiB/s (52944066031967 bytes in 56715.151 seconds)
[2023-09-30T08:19:33] Updating timestamps on newly copied files
[2023-09-30T08:19:33] Completed updating timestamps
[2023-09-30T08:19:33] Completed sync
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[49249,1],5]
  Exit code:    1
--------------------------------------------------------------------------   

I'll start using the --batch-files option, thanks for the tip.

I'm attempting to recover from this using an rsync command to update the time stamps.

rsync -rv --size-only --times $source $dest >> rsync-ccdev.out.times.$( date +%F )

But its going fairly slow. Woud just repeating the dsync update the timestamps?

dvickernasa commented 9 months ago

Actually, the data I copied was done with two separate dsync runs since they were in two different areas. I've attached the output from both runs. The ccdev run showed the error, but the cev-repo run did not. But the destinations for both runs have files with incorrect time stamps. I will attached the full output of both runs.

Any recommendations for fixing the time stamps would be appreciated. The rsync command I posted above should do it, but its slow. I'm thinking dsync won't work since the output contains Updating timestamps on newly copied files. That makes it sound like dsync won't update the timestamps on the files that were previously copied.

dsync-ccdev.out.2023-09-29.txt dsync-cev-repo.out.2023-09-29.txt

dvickernasa commented 9 months ago

I've been doing some tests this afternoon using a smaller dataset. Basically did this:

Even though no actual data was transferred for the 2nd dsync (I think, the 2nd transfer went fast), the time stamps in the destination got updated to their source time stamps. I used rsync after the 2nd dsync to see if it found any files it wanted to transfer again and it did not, indicating that the size and times of all files did indeed get updated.

I've restarted the dsync's of my original, large datasets so we'll see what happens.

adammoody commented 9 months ago

Ok, thanks for the update. I'm still puzzled as to why the timestamps aren't being set correctly the first time.

How many nodes and processes are you using?

adammoody commented 9 months ago

Does it do a better job at setting timestamps the first time if you add the --direct option?

adilger commented 9 months ago

@adammoody is there a reason not to set the timestamps on a file immediately after it is finished processing? That would avoid the risk of missing them being set, and also reduces overhead on the filesystem because it doesn't need to do another set of filename lookups on all of the files.

It is understandable that the timestamps on the directories should be deferred until all of the entries in the directory are updated (and also to avoid setting the directory timestamp millions of times for a large directory), but that is not the case for regular files.

At worst, if a large file is being written by multiple MPI ranks concurrently, the timestamp might be set multiple times, but that is fine given it only happens for very large files, and one of the threads will be the last one writing and will reset the timestamp when it is finished.

Separately, is the "--batch-files" option mainly to work around the lack of timestamp updates on files, or are there other reasons to use it? Should this be set to eg. 10000, or 100 x mpi_ranks, or similar by default? That would increase the directory timestamp updates marginally, but is unlikely to hurt overall transfer performance significantly (I guess some amount of straggler/idle thread overhead for each batch).

dvickernasa commented 9 months ago

How many nodes and processes are you using?

I'm using 16 cores on a single workstation for these transfers.

Does it do a better job at setting timestamps the first time if you add the --direct option?

In my small tests, I wasn't using --direct and the times were getting set properly. I verified by following up with an rsync, which didn't copy anything new. I can try a big transfer with --direct next time and see if that works better.

The 2nd dsync for the large dataset just went OK. The intent here was just to correct the times. For some reason, this happened:

[2023-10-03T18:42:03] Comparing file sizes and modification times of 3093774 items
[2023-10-03T18:42:03] Started   : Oct-03-2023, 18:42:01
[2023-10-03T18:42:03] Completed : Oct-03-2023, 18:42:03
[2023-10-03T18:42:03] Seconds   : 2.043
[2023-10-03T18:42:03] Items     : 3093774
[2023-10-03T18:42:03] Item Rate : 3093774 items in 2.042868 seconds (1514426.851390 items/sec)
[2023-10-03T18:42:03] Deleting items from destination
[2023-10-03T18:42:03] Removing 1740489 items ...

<cut>

[2023-10-03T18:46:44] Removed 1676960 items (96.35%) in 280.076 secs (5987.513 items/sec) 10 secs remaining ...
[2023-10-03T18:46:54] Removed 1739691 items (99.95%) in 290.075 secs (5997.386 items/sec) 0 secs remaining ...
[2023-10-03T18:46:54] Removed 1740489 items (100.00%) in 290.399 secs (5993.446 items/sec) done
[2023-10-03T18:46:54] Removed 1740489 items in 290.457 seconds (5992.237 items/sec)
[2023-10-03T18:46:54] Copying items to destination
[2023-10-03T18:46:54] Copying to /nobackup2/PROJECTS/orion/cev-repo
[2023-10-03T18:46:54] Items: 1740490
[2023-10-03T18:46:54]   Directories: 0
[2023-10-03T18:46:54]   Files: 1740490
[2023-10-03T18:46:54]   Links: 0
[2023-10-03T18:46:54] Data: 94.113 TiB (56.699 MiB per file)

The source data is almost completely static so I'm not sure why dsync decided it needed to delete and recopy about half of the data. Perhaps the incorrect dates from the first run threw it off? Its hard to tell for sure but it looked like there were far fewer than 50% of the files with wrong dates in the destination from the 1st dsync. Any thoughts on this mass deletion would be welcome.

adammoody commented 9 months ago

Regarding all of those deletes, I think that second dsync detected the mismatch in mtime between the source and destination files. It then assumes the files are different, it deletes the destination files, and it re-copies those files from the source. It would obviously have been smoother if the first dsync properly set the mtimes.

From your output logs, I don't see the types of errors that I'd expect to see if the job had been interrupted. It looks to print the message that it has successfully updated all timestamps. It looks buggy to me that it somehow failed.

One guess is that perhaps a trailing write came in to the file after the timestamp update had been processed on the metadata server. That shouldn't be happening, but maybe it is. I don't think that would happen either for a single node run or when using --direct which uses O_DIRECT, which is what looking for with those questions.

I'll also run some tests to see if I can reproduce that locally.

adammoody commented 9 months ago

Thank you for following along and for your suggestions, @adilger .

Yes, we currently set the timestamps once at the end to handle files that are split among multiple ranks. We don't have communication to know when all ranks are done with a given file, so instead we wait until all ranks are done with all files.

The --batch-files option is there to add intermediate global sync points for completing files early. It may provide some secondary benefit to sync procs up more frequently and perhaps improve work balance, but it's mainly there to set timestamps before waiting until the very end.

Picking the best batch size is not well defined. It effectively acts like a checkpoint operation, so it makes sense to pick a batch size based on the expected runtime and failure rate. It also depends on amount of data and the file sizes. I suggested 100k here since there were 4m files in total, so 100k just seemed like a decent checkpoint frequency. Pretty arbitrary...

I like the idea of setting the timestamp multiple times from multiple ranks. That does add extra load on the metadata server, but that's probably not a huge hit.

However, a remaining challenge is to figure out how to handle setting other properties like ownership, permission bits, and extended attributes. We set those at the end, and then we set the timestamps as the very last step. This ensures that the mtime on the destination only matches the source after the file has been fully copied (all data and attributes).

I know some properties can't be set too early, since for example, some of those can disable write access to the destination file.

And I think we'd need to consider the case where dsync might be interrupted and leaves a destination file with the correct size and mtime before all ranks had actually completed their writes. (The file size can be correct since the last bytes in the file might actually be written first.) We could perhaps write to a temporary file and rename to handle that, but then I think we run into a similar problem of knowing when it's time to do the rename.

Have more insight into those?

dvickernasa commented 9 months ago

Regarding all of those deletes, I think that second dsync detected the mismatch in mtime between the source and destination files. It then assumes the files are different, it deletes the destination files, and it re-copies those files from the source. It would obviously have been smoother if the first dsync properly set the mtimes.

Given this, I think @adilger's suggestion of writting the times right after the file write makes a lot of sense if its not too much of a performance hit. If dsync's behavior is to delete the file and start over again if the times don't match, it would be very beneficial to get the times updated ASAP so that if the transfer is interrupted, the next dsync can avoid as much re-transfer as possible.

One other possibility is implementing rsync's --partial behavior. If that is turned on, the part of the file that is transferred is kept and subsequent rsync runs pick up where the the previous left off. Of course, this would also require implementing the delta-transfer algorithm and I have no idea how much effort that would entail.

adilger commented 9 months ago

One option for detecting partial file transfer would be to check the blocks allocated to the file against the source file. If they are grossly smaller on the target, then it is clearly not finished copying the interior.

Alternately (and this might be better for performance reasons as well) you could make the work distribution function smarter and have only a single rank per OST object and copy the objects linearly to avoid holes. With modern Lustre, large O_DIRECT writes are much faster than buffered writes 20GB/s vs. 2GB/s).

Also, I imagine you only split a file into multiple parts to copy in parallel once it is over 1GiB in allocated space, so the cost of doing an MPI comm to the other ranks is pretty tiny compared to the file data moved, so making the client do multiple locks/opens for these files is probably more work in total.

dvickernasa commented 9 months ago

Sorry, I know this isn't directly related to this issue, but I have a question about dsync's behavior. Here is an example of the end of a transfer.

[2023-10-03T19:32:44] Copied 2.673 TiB (66%) in 2738.947 secs (0.999 GiB/s) 1438 secs left ...
[2023-10-03T19:32:51] Copied 2.684 TiB (66%) in 2746.006 secs (1.001 GiB/s) 1424 secs left ...
[2023-10-03T19:33:04] Copied 2.695 TiB (66%) in 2759.073 secs (1.000 GiB/s) 1414 secs left ...
[2023-10-03T19:33:14] Copied 2.707 TiB (66%) in 2769.405 secs (1.001 GiB/s) 1401 secs left ...
[2023-10-03T19:33:24] Copied 2.719 TiB (67%) in 2779.124 secs (1.002 GiB/s) 1388 secs left ...
[2023-10-03T19:33:34] Copied 2.730 TiB (67%) in 2789.049 secs (1.002 GiB/s) 1376 secs left ...
[2023-10-03T19:33:49] Copied 2.738 TiB (67%) in 2804.159 secs (1.000 GiB/s) 1370 secs left ...
[2023-10-03T19:34:04] Copied 2.755 TiB (68%) in 2818.953 secs (1.001 GiB/s) 1353 secs left ...
[2023-10-03T19:34:15] Copied 2.766 TiB (68%) in 2829.957 secs (1.001 GiB/s) 1341 secs left ...
[2023-10-03T19:34:21] Copied 2.777 TiB (68%) in 2836.497 secs (1.002 GiB/s) 1328 secs left ...
[2023-10-03T19:34:39] Copied 2.788 TiB (68%) in 2854.734 secs (1.000 GiB/s) 1320 secs left ...
[2023-10-03T19:34:48] Copied 2.802 TiB (69%) in 2863.235 secs (1.002 GiB/s) 1303 secs left ...
[2023-10-03T19:35:08] Copied 2.813 TiB (69%) in 2883.456 secs (0.999 GiB/s) 1295 secs left ...
[2023-10-03T20:09:39] Copied 2.831 TiB (69%) in 4954.226 secs (599.206 MiB/s) 2180 secs left ...
[2023-10-03T20:09:39] Copied 4.077 TiB (100%) in 4954.227 secs (862.846 MiB/s) done
[2023-10-03T20:09:39] Copy data: 4.077 TiB (4482383360736 bytes)
[2023-10-03T20:09:39] Copy rate: 862.846 MiB/s (4482383360736 bytes in 4954.227 seconds)
[2023-10-03T20:09:39] Syncing data to disk.
[2023-10-03T20:09:45] Sync completed in 6.131 seconds.

Note that the percentage is progressing along fairly linearly, then suddenly jumps a lot and finishes much earlier than predicted. I see this happen a lot. Do you know what's going on there?

My 2nd dsync of the large dataset finished and I have an rsync running to verify it doesn't find anything else to transfer. Its going to take awhile for that to finish but its been running most of the day and it hasn't found anything yet, which is a good sign. My original rsync verification starting finding files it wanted to transfer almost right away.

adammoody commented 9 months ago

The missing progress messages is a known problem. This is also noted in https://github.com/hpc/mpifileutils/issues/530

These progress messages are implemented using a tree-based communication pattern across ranks using point-to-point messages. Every 10 seconds or so, rank 0 initiates a message down the tree to request progress. Each intermediate rank receives that request message from its parent and forwards it down to its children. When the request message reaches the leaves, ranks respond and send their current progress status back up the tree. Intermediate ranks receive that data, merge in their own progress status, and respond back up the tree. Rank 0 receives the messages from its children, merges its data, and prints the final result. These reductions are asynchronous. Ranks periodically check for and respond to these messages as they come in.

I think what probably happens for these progress messages to get stalled is that some rank (any rank) gets stuck in an I/O call. Because it is blocked in the I/O call, it doesn't respond to MPI messages, which then blocks the progress tree. If that's what's happening, one solution would be to move the I/O calls into a pthread or maybe use async I/O routines so that the process can still make MPI calls while waiting on its current I/O operation to complete.

adilger commented 9 months ago

Is it possible to set a timeout on the MPI broadcast that doesn't also result in a total MPI failure? Since the status updates are somewhat advisory, unlike an HPC calculation, getting a partial status is probably fine.

adammoody commented 9 months ago

Yes, I think timeouts on the reduction operation could also work. I think we'd need to pair that with adding sequence numbers to the messages or internal counters so that a parent can distinguish late-arriving responses from its children.

It might also get strange, since the progress "sum" would be missing any results from the subtree rooted at the blocked rank. The reported progress would appear to bounce up and down during the run if such stalls are frequent as random ranks come and go. That might be countered by caching the last reported value from each child. I'd have to check whether that makes sense for the various reduction ops...

There may be another advantage to moving the I/O to a thread. For algorithms that use libcircle for work stealing, work on blocked ranks could be moved to other ranks faster. Having said that, changing all of the tools to move I/O to a separate thread is a bigger overhaul.

adammoody commented 9 months ago

Oh, wait. I'm describing the point-to-point reduction in libcircle:

https://github.com/hpc/libcircle/blob/406a0464795ef5db783971e578a798707ee79dc8/libcircle/token.c#L97

There is a separate progress reduction algorithm in mpiFileUtils, and I forgot that it uses non-blocking MPI collectives for that:

https://github.com/hpc/mpifileutils/blob/main/src/common/mfu_progress.c

That'd probably first need to be switched over to a point-to-point method (like in libcircle) to add timeouts.

adammoody commented 9 months ago

Since this is getting a bit involved, let's migrate discussion about missing progress messages to issue https://github.com/hpc/mpifileutils/issues/530.

dvickernasa commented 9 months ago

I've got a few more datapoints on setting the times.

First of all, the rsync verification of our 141 TB, 2.7M files dataset is still going. Its been running since 10/5. We closed off the permissions of the top level directory to ensure people can't access the data. So far, rsync has found about a couple dozen files that it wanted to re-transfer.

Over the weekend, we had another few sets of data we needed to transfer, each set containing about 250 GB and 1.7M files. Using a single node we can transfer one set in less than an hour using dsync. However, we found more indications of timestamps getting set incorrectly. This time we kept running dsync. After 2 or 3 times of dsync finding new files to transfer, we could get to the point where dsync itself no longer found new files to transfer. This was still not using --direct. The next time I use dsync, I'll try using --direct.

adammoody commented 9 months ago

To be sure, your latest transfer of 1.7M files saw mismatched timestamps, and this only ran dsync using a single node, right? That info helps to eliminate one theory I had.

And this is a Lustre-to-Lustre transfer, right?

Did you see any errors reported in the output of that first transfer?

dvickernasa commented 9 months ago

Correct - single node and lustre to lustre.

dvickernasa commented 9 months ago

Sorry, to answer your questions about errors, Yes, there were some errors. They did a couple intermediate transfers while the datasets were in use and being modified in an effort to reduce the time it took to transfer the data when the runs were completed.

ERROR: Source file `XXX' shorter than expected 4007 (errno=0 Success)

Although, I found this odd (the source data was smaller not larger), we didn't spend any time digging into this.

One the job were done and the data was static, the transfer did show just a couple of errors.

ERROR: Failed to open output file `XXX' (errno=13 Permission denied)
ERROR: Failed to change ownership on `XXX' lchown() (errno=13 Permission denied)

Which is kind of strange since the source and destination directories were both owned by the person doing the transfer. One more dsync run got all files and finished without errors.

dvickernasa commented 9 months ago

FYI, I started another transfer on Thursday. This one was quite large WRT the number of files (44.7M)

[2023-10-12T14:42:45] Copying to /nobackup2/PROJECTS/orion/gnc/em1data-copy
[2023-10-12T14:42:46] Items: 44656137
[2023-10-12T14:42:46]   Directories: 3593494
[2023-10-12T14:42:46]   Files: 39432055
[2023-10-12T14:42:46]   Links: 1630588
[2023-10-12T14:42:46] Data: 49.134 TiB (1.307 MiB per file)

I used --direct from the beginning on this one.

source=/.scratch00/agdl/em1data/
dest=/nobackup2/PROJECTS/orion/gnc/em1data-copy/
dsync_opts='-X non-lustre --batch-files 100000 --direct'

mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-em1.out.$( date +%F-%H-%M )

The scratch filesystem is NFS so this is a an NFS to lustre transfer. Unfortunately, it looks like this one died sometime on Saturday.

[2023-10-14T11:13:45] Copied 262.948 GiB (80%) in 851.497 secs (316.218 MiB/s) 219 secs left ...
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 4679 on node monte exited on signal 9 (Killed).
--------------------------------------------------------------------------

I'm not sure why it died yet - I'll be digging into that some more. However, after poking around in the partially transferred directory structure, it looks like all directories are owned by root. Or at least most of them. Here is an example of the top level directory.

[root@dvicker dsync]# ls -l /.scratch00/agdl/em1data/
total 20381086
drwxrws---  41 root     mpcvdata          57 Mar  6  2023 AR1_MER_GEM
drwxrwxrwx  11 mgualdon cev               11 Mar 21  2023 AR1_MER_POLARIS
-rw-r--r--   1 lhmolina mpcvdata 18450288381 Dec 10  2018 Data_20180219_0300.tgz
drwxr-sr-x   2 root     mpcvdata     2206262 Dec  7  2020 dumap_html
-rw-r--r--   1 root     mpcvdata        2782 Dec  7  2020 dumap.html
-rw-r--r--   1 root     mpcvdata  6744041294 Dec  4  2020 du.txt
drwxr-sr-x   8 choltz   mpcvdata           8 Jun 26  2018 EM1_DATA
drwxr-sr-x   5 choltz   mpcvdata           5 Aug 24  2016 EM1_FSW
drwxrws---   5 lmcev    mpcvdata           5 Jul 13  2020 EM1_iVAC
drwxrws---+ 13 choltz   mpcvdata          13 Jan 25  2023 EM1_RUNS
drwxr-sr-x  20 mfrostad mpcvdata          20 Jan 14  2021 EM1_SLS_Traj
drwxrws---   2 idux     mpcvdata           4 Apr 21  2021 EM1_TMT_Database
drwxr-sr-x   6 dheacock mpcvdata           6 Jul 13  2020 EM1_VAC
drwxr-s---   3 jarnold1 em1tppi            3 Aug  8  2019 EM_Flight_Traj_Limited_Rights
drwxr-sr-x  10 choltz   mpcvdata          16 May  3  2016 jfreechart-1.0.19
drwxrwsrwx   4 jmarcy   mpcvdata           4 Aug 10  2022 jmarcy
-rwxrwxrwx   1 jmarcy   mpcvdata      659220 Jan 19  2021 MEP_28ep9_20201215.xml
drwxrwsrwx   2 mstoft   mpcvdata           2 Jan 11  2022 mstoft
drwxrwsr-x+ 10 bawood   mpcvdata          14 May  8 13:58 orion_veras_archive
drwxr-sr-x   3 lgefert  mpcvdata           3 Feb  7  2019 prop_data
drwxrwsrwx  16 jmarcy   mpcvdata          16 Sep 21  2020 ramtares
drwxrwsrwx   6 mstoft   mpcvdata           7 Feb 10  2021 Rel28eP9Eng_Debug
drwxr-sr-x   5 choltz   mpcvdata           6 Nov 27  2017 socgnc_base
drwxr-s---  36 choltz   em1tppi           39 Oct 17  2022 socgnc_releases
drwxr-sr-x   5 choltz   mpcvdata           5 Mar 22  2016 SSW
[root@dvicker dsync]# ls -l /nobackup2/PROJECTS/orion/gnc/em1data-copy/
total 36212336
drwx------  41 root     root            4096 Oct 14 10:16 AR1_MER_GEM
drwx------  11 root     root            4096 Oct 12 14:42 AR1_MER_POLARIS
-rw-r--r--   1 lhmolina mpcvdata 18450288381 Dec 10  2018 Data_20180219_0300.tgz
drwx------   2 root     root       180535296 Oct 14 10:08 dumap_html
drwx------   8 root     root            4096 Oct 12 14:42 EM1_DATA
drwx------   5 root     root            4096 Oct 12 14:42 EM1_FSW
drwx------   5 root     root            4096 Oct 12 14:42 EM1_iVAC
drwxrwx---+ 13 root     root            4096 Oct 12 14:42 EM1_RUNS
drwx------  20 root     root            4096 Oct 12 14:42 EM1_SLS_Traj
drwx------   2 root     root            4096 Oct 12 14:42 EM1_TMT_Database
drwx------   6 root     root            4096 Oct 12 14:42 EM1_VAC
drwx------   3 root     root            4096 Oct 12 14:42 EM_Flight_Traj_Limited_Rights
drwx------  10 root     root            4096 Oct 14 10:05 jfreechart-1.0.19
drwx------   4 root     root            4096 Oct 12 14:42 jmarcy
drwx------   2 root     root            4096 Oct 12 14:42 mstoft
drwx------+ 10 root     root            4096 Oct 14 10:05 orion_veras_archive
drwx------   3 root     root            4096 Oct 12 14:42 prop_data
drwx------  16 root     root            4096 Oct 12 14:42 ramtares
drwx------   6 root     root            4096 Oct 12 14:42 Rel28eP9Eng_Debug
drwx------   5 root     root            4096 Oct 12 14:42 socgnc_base
drwx------  36 root     root            4096 Oct 12 18:55 socgnc_releases
drwx------   5 root     root            4096 Oct 12 14:42 SSW
[root@dvicker dsync]# 

Is this expected?

dvickernasa commented 9 months ago

A few updates.

The first dsync run of this latest transfer died due to an OOM event on the machine I was running on. I think it was dsync itself that caused the OOM, since this was a pretty isolated machine and nothing else was running there. But I'm not 100% on that. I didn't mention it before, but this is also a single node run.

The sync is restarted and, unfortunately, I'm having similar issues. dsync decided to delete a lot of what it already transferred.

[2023-10-16T11:22:01] Comparing file sizes and modification times of 17155249 items
[2023-10-16T11:22:19] Started   : Oct-16-2023, 11:21:26
[2023-10-16T11:22:19] Completed : Oct-16-2023, 11:22:19
[2023-10-16T11:22:19] Seconds   : 53.312
[2023-10-16T11:22:19] Items     : 17155249
[2023-10-16T11:22:19] Item Rate : 17155249 items in 53.312461 seconds (321786.852653 items/sec)
[2023-10-16T11:22:19] Deleting items from destination
[2023-10-16T11:22:19] Removing 10915151 items

I'm attaching the full output for both the original and the restart for this latest run.

dsync-em1.out.2023-10-16-09-50.txt dsync-em1.out.2023-10-12-12-52.txt

adammoody commented 9 months ago

Ugh, that's frustrating. I suspect that the timestamps for the files on that first run are still not being set correctly.

It looks like the first run successfully completed a number of 100k file-transfer batches before it died (193 batches to be precise).

[2023-10-14T10:56:04] Copied 19300000 of 44656137 items (43.219%)

So at most it should have had to delete about 100k files, which would be the files just from the batch being worked when the run died. Instead, it thought 11m million files needed to be replaced.

It seems like it's either missing setting the timestamps on some files, or perhaps it's hitting an error and not reporting it. I'll review the code again with an eye for those kinds of problem.

You might as well drop --direct for future runs. We now know that doesn't seem to help, and it may slow down your transfers. Thanks for checking that.

Having root ownership left on the directories is expected, because the directory permissions are not updated until all files have been transferred. Since the first run didn't complete, it didn't get a chance to change the ownership on the directories. However, it should have changed ownership and set timestamps on all of those previous files.

It'd be helpful if we could find a reproducer where you don't have to transfer a ton of data. You mentioned some smaller transfers worked on the first attempt.

Do you have any idea of how large you have to get before you see things break?

dvickernasa commented 9 months ago

You might as well drop --direct for future runs. We now know that doesn't seem to help, and it may slow down your transfers. Thanks for checking that.

No problem and I will drop this for future runs.

It'd be helpful if we could find a reproducer where you don't have to transfer a ton of data. You mentioned some smaller transfers worked on the first attempt.

Do you have any idea of how large you have to get before you see things break?

Yes, I agree. I will see if I can find a small-ish dataset to reproduce this.

adammoody commented 9 months ago

This will produce some long output, but let's add some debug statements to print info when dsync sets the timestamp on a file.

Can you modify src/common/mfu_io.c to add the printf and preceding formatting lines as shown below and rebuild?

int mfu_utimensat(int dirfd, const char* pathname, const struct timespec times[2], int flags)
{   
    int rc;
    int tries = MFU_IO_TRIES;
retry:
    errno = 0;
    rc = utimensat(dirfd, pathname, times, flags);
    if (rc != 0) {
        if (errno == EINTR || errno == EIO) {
            tries--;
            if (tries > 0) {
                /* sleep a bit before consecutive tries */
                usleep(MFU_IO_USLEEP);
                goto retry;
            }
        }
    }

    //------------------------------
    char access_s[30];
    char modify_s[30]; 
    size_t access_rc = strftime(access_s, sizeof(access_s) - 1, "%FT%T", localtime(&times[0].tv_sec));
    size_t modify_rc = strftime(modify_s, sizeof(modify_s) - 1, "%FT%T", localtime(&times[1].tv_sec));
    printf("%s rc=%d errno=%d atime_sec=%llu atime_ns=%llu mtime_sec=%llu mtime_ns=%llu atime=%s mtime=%s\n",
        pathname, rc, errno, times[0].tv_sec, times[0].tv_nsec, times[1].tv_sec, times[1].tv_nsec, access_s, modify_s);
    //------------------------------

    return rc;
}

Also add an #include <time.h> statement at the top of the file.

Execute a dsync and if you find any files have mismatching timestamps, look for those same paths in the output. Let's verify that a line is actually printed, that the input timestamps are correct, and that the utimensat call returns with rc=0.

BTW, it seems like it's not too hard to find files with mismatched timestamps, but if you need to scan a lot of files manually, we can probably use dcmp to help with that.

No need to copy the actual output here, but let me know what you find.

dvickernasa commented 9 months ago

Lots of updates....

Code changes

Sure, I can do that. If we keep going down this road, it might be nice to make a branch I can check out and compile against so I'm not making the changes by hand.

Smaller test case

I tried to find a smaller test case. Picked a directory tree with 400k files in 4.8 TB. Did the initial transfer last night in about 2 hours. Repeated the transfer this morning and it only found 1 file to re-transfer. So perhaps this is a better test case to use for testing.

Update on latest large transfer

This is the 40M file case. I had restarted the transfer yesterday but it crashed again just now. If you want me to create a different issue for this, I can. Console output for this crash:

[monte:24626] *** Process received signal ***
[monte:24626] Signal: Segmentation fault (11)
[monte:24626] Signal code: Address not mapped (1)
[monte:24626] Failing at address: 0x80
[monte:24626] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7fc689c8c630]
[monte:24626] [ 1] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/openmpi/mca_oob_tcp.so(+0xb1b7)[0x7fc687c5b1b7]
[monte:24626] [ 2] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_recv_connect_ack+0xbc2)[0x7fc687c5c222]
[monte:24626] [ 3] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/openmpi/mca_oob_tcp.so(+0x69f1)[0x7fc687c569f1]
[monte:24626] [ 4] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x7a4)[0x7fc68aa16924]
[monte:24626] [ 5] mpiexec[0x404d52]
[monte:24626] [ 6] mpiexec[0x403610]
[monte:24626] [ 7] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc6898d1555]
[monte:24626] [ 8] mpiexec[0x403529]
[monte:24626] *** End of error message ***
./dsync-em1.sh: line 22: 24626 Segmentation fault      mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-em1.out.$( date +%F-%H-%M )

Here are the /var/log/messages for this:

Oct 17 10:58:37 monte kernel: dsync[24639]: segfault at 80 ip 00007f6cd1ec01b7 sp 00007f6cd1cafce0 error 4 in mca_oob_tcp.so[7f6cd1eb5000+13000]
Oct 17 10:58:37 monte kernel: dsync[24637]: segfault at 80 ip 00007ff86669d1b7 sp 00007ff86648cce0 error 4 in mca_oob_tcp.so[7ff866692000+13000]
Oct 17 10:58:37 monte kernel: dsync[24683]: segfault at 80 ip 00007f410599e1b7 sp 00007f410578dce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24680]: segfault at 80 ip 00007ff888bf51b7 sp 00007ff8889e4ce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24684]: segfault at 80 ip 00007f3725a891b7 sp 00007f3725878ce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24682]: segfault at 80 ip 00007fd6cae651b7 sp 00007fd6cac54ce0 error 4
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7f3725a7e000+13000]
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7ff888bea000+13000]
Oct 17 10:58:37 monte kernel: 
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fd6cae5a000+13000]
Oct 17 10:58:37 monte kernel: 
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7f4105993000+13000]
Oct 17 10:58:37 monte kernel: Pid 24670(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24674(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: dsync[24665]: segfault at 80 ip 00007f5faa6201b7 sp 00007f5faa40fce0 error 4 in mca_oob_tcp.so[7f5faa615000+13000]
Oct 17 10:58:37 monte kernel: dsync[24673]: segfault at 80 ip 00007fd64f05b1b7 sp 00007fd64ee4ace0 error 4
Oct 17 10:58:37 monte kernel: dsync[24669]: segfault at 80 ip 00007fd2215721b7 sp 00007fd221361ce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24647]: segfault at 80 ip 00007fdb7e2401b7 sp 00007fdb7e02fce0 error 4
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fd221567000+13000]
Oct 17 10:58:37 monte kernel: 
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fdb7e235000+13000]
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fd64f050000+13000]
Oct 17 10:58:37 monte kernel: Pid 24634(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24636(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24640(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24651(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Pid 24632(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24644(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24664(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24668(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24660(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24658(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24646(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24666(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24635(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24656(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24654(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24633(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24662(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24631(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24638(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24648(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24652(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24642(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24626(mpiexec) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24629 (dsync) of user 0 killed by SIGSEGV - ignoring (repeated crash)
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24677 (dsync) of user 0 killed by SIGSEGV - dumping core
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24672 (dsync) of user 0 killed by SIGSEGV - ignoring (repeated crash)
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24630 (dsync) of user 0 killed by SIGSEGV - ignoring (repeated crash)
Oct 17 10:58:38 monte kernel: RPC: fragment too large: 612067950
Oct 17 10:59:05 monte abrt-hook-ccpp: Saved core dump of pid 24672 to core.24672 at /home/dvicker/dsync (3906400256 bytes)
Oct 17 10:59:05 monte kernel: RPC: fragment too large: 612067950
Oct 17 10:59:08 monte abrt-hook-ccpp: Saved core dump of pid 24630 to core.24630 at /home/dvicker/dsync (5385498624 bytes)
Oct 17 10:59:11 monte kernel: LNetError: 3293:0:(acceptor.c:431:lnet_acceptor()) Refusing connection from 128.157.254.142: insecure port 40552
Oct 17 10:59:20 monte abrt-hook-ccpp: Saved core dump of pid 24629 to core.24629 at /home/dvicker/dsync (5427765248 bytes)
Oct 17 10:59:21 monte kernel: LNetError: 3293:0:(acceptor.c:431:lnet_acceptor()) Refusing connection from 128.157.254.142: insecure port 34726
Oct 17 10:59:21 monte kernel: LNetError: 3293:0:(acceptor.c:431:lnet_acceptor()) Skipped 9 previous similar messages
Oct 17 10:59:38 monte abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ccpp-2023-09-21-11:01:07-18361'
Oct 17 10:59:38 monte abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'vmcore-127.0.0.1-2018-07-12-08:53:30'
Oct 17 10:59:38 monte abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ccpp-2023-07-28-16:59:54-16739'
Oct 17 10:59:38 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:39 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:39 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:40 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:40 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:41 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:41 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:42 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:42 monte abrt-server: Generating core_backtrace
Oct 17 10:59:45 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:45 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:45 monte abrt-server: Email address of sender was not specified. Would you like to do so now? If not, 'user@localhost' is to be used [y/N] 
Oct 17 10:59:45 monte abrt-server: Email address of receiver was not specified. Would you like to do so now? If not, 'root@localhost' is to be used [y/N] 
Oct 17 10:59:45 monte abrt-server: Undefined variable outside of [[ ]] bracket
Oct 17 10:59:45 monte abrt-server: Sending an email...
Oct 17 10:59:45 monte abrt-server: Sending a notification email to: root@localhost
Oct 17 10:59:45 monte abrt-server: Email was sent to: root@localhost

This created 4 core files if you want them. This is different than the last crash, which was an OOM event.

I don't want to restart this sync again until I have confidence that its not going to delete a bunch of files again. So lets do some debugging on some smaller cases first.

adammoody commented 9 months ago

Good idea regarding the branch. I've got one with the printf added here:

https://github.com/hpc/mpifileutils/tree/debugtimestamps

I don't want to restart this sync again until I have confidence that its not going to delete a bunch of files again. So lets do some debugging on some smaller cases first.

Yes, that makes total sense. I'm hoping the smaller test case works, as well.

If we can get a stack trace from the core files, that would be helpful.

Do you get something with gdb?

adammoody commented 9 months ago

If it helps, you can use dcmp to scan for files that have mismatching size/mtime values after a dsync using the following command:

dcmp --output EXIST=COMMON@CONTENT=DIFFER:mismatched.txt --text --lite /source /destination

That writes a list of any files to mismatched.txt that dcmp finds to exist in both source and destination paths, but have different contents. The --lite option tells dcmp to compare file "content" by using size/mtime rather than reading and comparing actual file data.

dvickernasa commented 9 months ago

Yes, here are the backtraces from gdb.

core.24629.txt core.24630.txt core.24672.txt core.24677.txt

dvickernasa commented 9 months ago

Thanks for the dcmp command - that will be very helpful. I can use that in combination with rsync to set the times correctly, which is very helpful in salvaging the files that are already transferred.

I'll work on compiling the debug code and running some tests next.

I really appreciate your help!

adammoody commented 9 months ago

Glad to help. Sorry for all of the grief you're hitting with this, but thanks for taking the time to chase it down.

It looks like the core dump stack traces are all somewhere within MPI -- I don't see any mpiFileUtils calls in those traces. With just that info, it's hard to say what led to the segfault. If it's reproducible, maybe you could catch it live under a debugger as the next step. To track that, let's open a new issue. I think it may be an independent problem.

Assuming dsync is actually setting timestamps on each file (to the correct value and without error), then one theory I'm still stuck on is that perhaps a write from some process is being processed by the server after the timestamp has already been set. That late write would end up updating the mtime to the current time, and it could be racy. That seems to fit well with what we're seeing here.

I was hoping the --direct option which enables O_DIRECT would avoid the late writes. From your earlier test, we now know the problem still exists with O_DIRECT, which would seem to rule out that theory. Someone suggested to me today that there might still be cases where O_DIRECT writes could be getting cached, so maybe this idea is still in play.

One more thing I'd like to test is whether forcing an fsync() before closing each file would help. We already have sync() + MPI_Barrier() calls between completing the writes and setting the timestamps. That sync() is meant to flush any cached data to avoid any late write problem, but maybe it's not working for some reason. I have a change in the pipeline that will add an fsync() that I can push to the new branch. We can try that next.

I'd like to first verify that the timestamps are actually being set to the correct values using the printf statement we just added.

dvickernasa commented 9 months ago

I ran dcmp on the partial transfer of my 40M file case and have some questions. Here's the last part of the output.

[2023-10-17T20:26:13] Walked 18572701 items in 1192.205 seconds (15578.441 items/sec)
[2023-10-17T20:26:38] Comparing items
[2023-10-17T20:26:54] Started   : Oct-17-2023, 20:26:38
[2023-10-17T20:26:54] Completed : Oct-17-2023, 20:26:54
[2023-10-17T20:26:54] Seconds   : 15.308
[2023-10-17T20:26:54] Items     : 44656137
[2023-10-17T20:26:54] Item Rate : 44656137 items in 15.308301 seconds (2917119.089945 items/sec)
[2023-10-17T20:26:54] Bytes read: 0.000 B (0 bytes)
[2023-10-17T20:26:54] Byte Rate : 0.000 B/s (000 bytes in 15.308 seconds)
[2023-10-17T20:27:02] Writing to output file: mismatched.txt
[2023-10-17T20:27:55] Wrote 16081904 files in 53.309 seconds (301674.447 files/sec)
Number of items that exist in both directories and have different contents: 8040952 (Src: 8040952 Dest: 8040952), dumped to "mismatched.txt"

Looking at the results,

[root@monte dsync]# grep JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv mismatched.txt 
-rw-rw---- 7135 7135 596.170 KiB Mar  6 2022 10:10 /.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
-rw-rw---- 7135 7135 596.170 KiB Mar  6 2022 10:10 /nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
[root@monte dsync]# 

Those look to be identical in size and time. Just to confirm, I did a regular ls and diff too.

[root@monte dsync]# ls -l /.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
-rw-rw---- 1 pgarcia6 pgarcia6 610478 Mar  6  2022 /.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
[root@monte dsync]# ls -l /nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
-rw-rw---- 1 pgarcia6 pgarcia6 610478 Mar  6  2022 /nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
[root@monte dsync]# diff /.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv /nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
[root@monte dsync]# 

So why was this reported as mismatched? This could be a contributor to the dsync issue too.

dvickernasa commented 9 months ago

Sorry, I meant to mention that of the 18M files that have been transferred, its reporting that 16M of them differ, which doesn't seem to make sense.

adammoody commented 9 months ago

Oh, that's interesting. I'm only print the integer units of mtime, but maybe the nanosecond values are different.

Can you run stat on a pair of those files and check the fractional seconds of mtime?

dvickernasa commented 9 months ago

Sure. That does appear to be the issue.

[root@monte dsync]# stat /.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv /nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
  File: ‘/.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv’
  Size: 610478      Blocks: 202        IO Block: 1048576 regular file
Device: 3dh/61d Inode: 274783365   Links: 1
Access: (0660/-rw-rw----)  Uid: ( 7135/pgarcia6)   Gid: ( 7135/pgarcia6)
Access: 2022-03-06 10:10:09.757465521 -0600
Modify: 2022-03-06 10:10:10.735468368 -0600
Change: 2022-03-06 10:12:16.267833886 -0600
 Birth: -
  File: ‘/nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv’
  Size: 610478      Blocks: 2400       IO Block: 4194304 regular file
Device: f87a3582h/4168758658d   Inode: 198162015687024217  Links: 1
Access: (0660/-rw-rw----)  Uid: ( 7135/pgarcia6)   Gid: ( 7135/pgarcia6)
Access: 2023-10-18 11:28:01.000000000 -0500
Modify: 2022-03-06 10:10:10.000000000 -0600
Change: 2023-10-16 15:50:36.000000000 -0500
 Birth: -
[root@monte dsync]# 
adammoody commented 9 months ago

Ok, and here em1data-copy is the destination file, right?

Now I wonder if dsync set that nanosecond value on the destination to be 0 to begin with, or did it try to set it to a non-zero value but the file system didn't accept it. It'd be good to figure that bit out. I think the output from the printf debug statement should help determine that.

In the meantime, looking at the man page for rsync, it offers some control when comparing mtime values:

-@, --modify-window When comparing two timestamps, rsync treats the timestamps as being equal if they differ by no more than the modify-window value. The default is 0, which matches just integer seconds. If you specify a negative value (and the receiver is at least version 3.1.3) then nanoseconds will also be taken into account. Specifying 1 is useful for copies to/from MS Windows FAT filesystems, because FAT represents times with a 2-second resolution (allowing times to differ from the original by up to 1 second).

If you want all your transfers to default to comparing nanoseconds, you can create a ~/.popt file and put these lines in it:

rsync alias -a -a@-1 rsync alias -t -t@-1

With that as the default, you’d need to specify --modify-window=0 (aka -@0) to override it and ignore nanoseconds, e.g. if you’re copying between ext3 and ext4, or if the receiving rsync is older than 3.1.3.

We may need to add a similar option for dsync.

This fractional second difference may be a second issue. I recall your other examples showed timestamps that differed by calendar date, so they were off by way more than just the nanosecond field.

adammoody commented 9 months ago

Got the file. Dropped the comment just to hide file names. So dsync is setting the nanosecond fields to 0.

rc=0 errno=0 atime_sec=1697655016 atime_ns=0 mtime_sec=1589398871 mtime_ns=0 atime=2023-10-18T13:50:16 mtime=2020-05-13T14:41:11

To double check, if you stat one of the source files used for that test, can you confirm that the fractional seconds are something other than 0?

adammoody commented 9 months ago

I pushed a commit to the debug branch that will print the atime/mtime values for files from the source list, specifically files that exist in the source but not the destination (first time copy). That will tell us whether dsync is reading the source times correctly.

dvickernasa commented 9 months ago

Sorry, I'm switching things back and forth on you quite a bit. Lots of things going on. My apologies, this is making it harder to keep things straight...

Ok, and here em1data-copy is the destination file, right?

Correct.

We may need to add a similar option for dsync.

I agree.

This fractional second difference may be a second issue. I recall your other examples showed timestamps that differed by calendar date, so they were off by way more than just the nanosecond field.

Yes, I agree with this and meant to mention it myself before. I'm doing both NFS to lustre transfers and lustre to lustre transfers here and I think we are potentially running into separate issues. Like you said, I've seen cases where the times in the destination were off by week and months compared to the source. This nanosecond difference is due to filesystem differences.

The 40M file case is an NFS to lustre transfer where the backing filesystem on the NFS side is ZFS. ZFS keeps track of fractional seconds. I'm guessing you know this already, but a lustre filesystem does not keep track of fractional seconds.

The small test case I did with the debug version (dsync-timestamp.2023-10-18-14-05.txt) was a lustre to lustre transfer. I'm trying not to stress our NFS filesystem with too much parallel access if I don't need to.

To double check, if you stat one of the source files used for that test, can you confirm that the fractional seconds are something other than 0?

Just to be clear, that was a lustre to lustre copy so there should have been no fractional seconds in either case. But for completeness, here is one of the source files.

[dvicker@monte dsync]$ stat /nobackup/dvicker/cst-grid-scripts/README.md 
  File: ‘/nobackup/dvicker/cst-grid-scripts/README.md’
  Size: 703         Blocks: 71         IO Block: 4194304 regular file
Device: f90c7a4h/261146532d Inode: 144120192501284956  Links: 1
Access: (0600/-rw-------)  Uid: ( 3179/ dvicker)   Gid: ( 3000/     eg3)
Access: 2023-10-18 14:05:57.000000000 -0500
Modify: 2020-05-13 14:41:09.000000000 -0500
Change: 2023-10-03 17:33:56.000000000 -0500
 Birth: -
[dvicker@monte dsync]$ 

I pushed a commit to the debug branch that will print the atime/mtime values for files from the source list, specifically files that exist in the source but not the destination (first time copy). That will tell us whether dsync is reading the source times correctly.

Sure. I'm assuming you want the same small test I did before, which is a lustre to lustre copy?

FYI, the decoder ring for our systems:

Again, sorry for confusion. I'll try to keep things more consistent (or at least communicate the details more clearly).

dvickernasa commented 9 months ago

I pushed a commit to the debug branch that will print the atime/mtime values for files from the source list, specifically files that exist in the source but not the destination (first time copy). That will tell us whether dsync is reading the source times correctly.

I repeated the same lustre to lustre test I did before with the updated debug branch. Happy to do something else if needed. This was a clean sync with no existing files in the destination.

dsync-timestamp.2023-10-18-17-39.txt

adammoody commented 9 months ago

Ah, I had forgotten (or never realized) that Lustre only uses the integer field in mtime. I'm surprised no one has noted this problem with repeated dsyncs from NFS to Lustre already.

I'll look to add an option like with rsync to flip between checking just seconds and seconds+nanoseconds. In the short term, if you need to do more NFS-to-Lustre transfers before that's available, we can hack that the internal check to only consider the integer portion. That's an easy change.

Coming back to the problem regarding the bigger timestamp differences in your Lustre-to-Lustre transfers, if you can reproduce that problem using the debug branch, the extra output should provide some clues. I just pushed another update to drop an extra printf we likely don't need now that we found the source of the nanosecond problem.

adammoody commented 9 months ago

I repeated the same lustre to lustre test I did before with the updated debug branch. Happy to do something else if needed. This was a clean sync with no existing files in the destination.

Does dcmp report any mismatching timestamps in that most recent Lustre-to-Lustre test?

dvickernasa commented 9 months ago

Not exactly, but I did delete one file in the source and repeated the dsync with the --delete option. I've been wanting to test the --delete option anyway. I used the debug version for this. The sync did what was expected (removed the one file) but also did a bunch of time setting as well. I think it set the time again for all files, which is not consistent with the Updating timestamps on newly copied files output. Results attached.

dsync-timestamp.2023-10-18T18-25.txt

I have some more data for a new, independent transfer as well. Last night I started a lustre to lustre sync we needed. My typical 1 node transfer. 40+ TB so it took most of the night. Here is the initial transfer. I used the non-debug version for this. Output for the first run:

dsync-orion-aero.2023-10-18T16-54.txt

There are some interesting errors in there. I did a dcmp this morning after the transfer was complete. 8 files reported

Number of items that exist in both directories and have different contents: 8 (Src: 8 Dest: 8), dumped to "orion-aero.mismatched.txt"

orion-aero.mismatched.txt

I repeated the dsync for this with the debug code.

dsync-orion-aero.2023-10-19T09-01.txt.gz

Hopefully this is still useful. Or do I need to run the debug code on the initial sync that get the time stamps wrong to begin with?

adammoody commented 9 months ago

which is not consistent with the Updating timestamps on newly copied files output.

I had recently noticed that description is not quite accurate, as well. I plan to update that. Thanks.

Do you find any wrong timestamps after using the debug transfer?

This should also be a transfer from scratch. For any files that you find with dcmp that that have the wrong timestamps, I'd like to then look back at the output from the debug run to see what timestamp values it tried to apply to those destination files.

dvickernasa commented 9 months ago

Do you find any wrong timestamps after using the debug transfer?

No.

This should also be a transfer from scratch. For any files that you find with dcmp that that have the wrong timestamps, I'd like to then look back at the output from the debug run to see what timestamp values it tried to apply to those destination files.

I figured this was the case. I will start using the debug code for all transfers for now. It seems to me like it takes a fairly large transfer to induce this behavior. All of my smaller transfers seem to work the first time.

I'll look to add an option like with rsync to flip between checking just seconds and seconds+nanoseconds. In the short term, if you need to do more NFS-to-Lustre transfers before that's available, we can hack that the internal check to only consider the integer portion. That's an easy change.

I would like to get my NFS to lustre transfer going again. Can you provide a branch for this? I would like to get that transfer going again.

adammoody commented 9 months ago

Sure, this branch ignores the fractional seconds when comparing file mtime values.

https://github.com/hpc/mpifileutils/tree/mtimensec

This does not have the debug statements, so it should be quieter.

dvickernasa commented 9 months ago

Excellent, thanks. I'm running dcmp from this branch on the NFS to lustre transfer to verify it finds no (or few) differences. If so, I'll resume the dsync. I'm assuming the code for time comparison is common between the two and that will work, correct?

I'm trying to run some small tests with the time stamp debug to capture some data there.