hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
168 stars 66 forks source link

dsync not updating times correctly #560

Open dvickernasa opened 1 year ago

dvickernasa commented 1 year ago

Hello,

I recently discovered mpifileutils and they are awesome. I moved about 200 TB in 4 million files between two different lustre filesystems over the weekend. Thank you for your work on these tools. After doing an initial data move, my intent was to keep the data in sync for a period of time using rsync. With the bulk of the data transferred, this should go fast but it would also catch any file types that `dsync doesn't handle (hard links, etc.). Here is the script I'm using.

source=/nobackup/PROJECTS/ccdev/
dest=/nobackup2/PROJECTS/ccdev/

mpiexec --allow-run-as-root dsync -X non-lustre $source $dest > dsync-ccdev.out.$( date +%F )

rsync -aPHASx $source $dest > rsync-ccdev.out.$( date +%F )

The problem I'm running into is that rsync is copying files that dync already copied. Here is one example of a file that rsync is trying to transfer again.

Source:

238412051 -rw-r----- 1 dvicker eg3 244210724864 Aug 23 16:38 /nobackup/PROJECTS/ccdev/boeing/aero/adb/ar/ar-204.squashfs

Destination:

238487228 -rw-r----- 1 dvicker eg3 244210724864 Sep 29 17:15 /nobackup2/PROJECTS/ccdev/boeing/aero/adb/ar/ar-204.squashfs

The file size it correct but it looks like the date on the destination is wrong - this is the date that that the file got created. After poking around in the destination more, it looks like this is prevalent - dates were not preserved. This will trip up rsync's default algorithm of checking by date and size. I could switch to rsync's checksum mode to get around this but that will be quite a bit more time consuming for this dataset.

Is this a bug with dsync or am I possibly doing something wrong? I'm using version 0.11.1 of mpifileutils.

dvickernasa commented 12 months ago

This reminds me of a question I wanted to ask. I see that dcp can take input from dwalk to reduce the number of times you walk the actual filesystem. For many of the transfers I'm doing, the source data is static and it would be great to do something similar - especially for this 40M file NFS to lustre transfer. Does it make sense to add --input to dsync? Or is there a way to do that with the existing codes?

adammoody commented 12 months ago

Yes, the branch updates both dsync and dcmp so they are both only using the integer portion, so you should see consistent results.

I'll look at the --input option. That's a good idea. Obviously, that could provide a stale view of the source in the general case, but it would be helpful on large file systems where files don't change. I can't remember offhand if there might be some gotchas involved.

Crossing my fingers that one of the smaller lustre-to-lustre runs can reproduce the timestamp mismatch. Thanks for taking the time to run these tests.

dvickernasa commented 12 months ago

The dcmp walk is 13M files into the source FS. I'll let you know how that turns out.

Great, thanks, that will be handy.

I tried a very small test a couple of times and everything worked in one shot. Trying the 400k, ~5 TB case that had a handful of files not work properly the first time. That takes a couple hours so I'll let you know.

dvickernasa commented 12 months ago

OK, 40M NFS to lustre case with the updated branch looking good. Here's the dcmp results.

[2023-10-19T20:22:41] Walked 18572701 items in 1222.146 seconds (15196.794 items/sec)
[2023-10-19T20:23:06] Comparing items
[2023-10-19T20:23:22] Started   : Oct-19-2023, 20:23:06
[2023-10-19T20:23:22] Completed : Oct-19-2023, 20:23:22
[2023-10-19T20:23:22] Seconds   : 15.899
[2023-10-19T20:23:22] Items     : 44464314
[2023-10-19T20:23:22] Item Rate : 44464314 items in 15.898896 seconds (2796691.924957 items/sec)
[2023-10-19T20:23:22] Bytes read: 0.000 B (0 bytes)
[2023-10-19T20:23:22] Byte Rate : 0.000 B/s (000 bytes in 15.899 seconds)
[2023-10-19T20:23:29] Writing to output file: mismatched.em1.2023-10-19-18-13T18-13-00txt
[2023-10-19T20:23:31] Wrote 81904 files in 2.380 seconds (34419.940 files/sec)
Number of items that exist in both directories and have different contents: 40952 (Src: 40952 Dest: 40952), dumped to "mismatched.em1.2023-10-19-18-13T18-13-00txt"

Starting the transfer again.

dvickernasa commented 12 months ago

The 40M NFS to lustre transfer is working great. Thanks a lot for the updated code.

I've been trying run a test where I do a clean sync followed by a repeat to catch a re-transfer of a file that didn't get set properly the first time. This is the setup.

ml load mpifileutils/debugtimestamps

source=/nobackup/PROJECTS/ccdev.to-delete/spacex/chutes/                                                   
dest=/nobackup2/dvicker/dsync_test/chutes/
dsync_opts='-X non-lustre --batch-files 100000'

mpiexec --allow-run-as-root drm $dest

mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out
mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out

I've run it a few times now and I think it works properly the first time for all my tests. Here is the output from the last run.

dsync-testing.2023-10-19T23-24-16.out.gz dsync-testing.2023-10-20T01-30-16.out.gz

Maybe there is still some useful info there. I'll keep running this, hoping to catch a re-transfer but let me know if you have any suggestions.

adammoody commented 12 months ago

The 40M NFS to lustre transfer is working great. Thanks a lot for the updated code.

Great! Thanks again for helping to track that one down. I'll have a PR with an option for that soon.

Yes, your setup for the lustre-to-lustre dsync test looks good. It may help to include a dcmp in between the two dsync commands, since that output makes it a bit easier to see the timestamp mismatches.

If these jobs don't yet reproduce the problem, I suppose the next step would be to start increasing the file count. Hopefully, there is some smallish size that will expose the problem reliably.

adammoody commented 12 months ago

Oh, never mind about adding dcmp. We can spot any differences if we see any files show up during the delete/overwrite portion in the second dsync.

adammoody commented 11 months ago

@dvickernasa , have you had any transfers reproduce the Lustre timestamp problem?

dvickernasa commented 11 months ago

Sorry, this fell off my radar. The "production" transfers I needed to do are done and I didn't get back to my testing. I'm restarting a transfer as a test. I'm going to recreate the original transfer this time, including the original options I used (most notably without --batch-files. I don't think I've run into a case where this didn't work when using --batch-files and I'm starting to suspect that is why. This new transfer is the exact same set of files from the first post in this issue. The source and destination filesystem are the same as well, I'm just using a different path in the destination FS. Fingers crossed that this recreates it. Details below. I'm using the debugtimestamps branch for both dsync runs so if we do catch the problem, we should have some good data to go over.

ml load mpifileutils/debugtimestamps

source=/nobackup/PROJECTS/ccdev.to-delete/
dest=/nobackup2/dvicker/dsync_test/ccdev/

dsync_opts='-X non-lustre'

mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out
mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out
dvickernasa commented 11 months ago

OK, I think we caught a few files this time.

FYI, the source was closed off to regular users so I'm certain it was static.

[root@dvicker dsync]# ls -ld /nobackup/PROJECTS/ccdev.to-delete 
d--------- 5 root root 11776 Aug 24 17:08 /nobackup/PROJECTS/ccdev.to-delete
You have new mail in /var/spool/mail/root
[root@dvicker dsync]# 

Here is the end of the first transfer:

[dvicker@dvicker dsync]$ tail -15 dsync-testing.2023-11-03T13-27-57.out
[2023-11-04T06:10:03] Updated 1151121 items in 408.878 seconds (2815.316 items/sec)
[2023-11-04T06:10:03] Syncing directory updates to disk.
[2023-11-04T06:10:06] Sync completed in 2.394 seconds.
[2023-11-04T06:10:06] Started: Nov-03-2023,14:05:41
[2023-11-04T06:10:06] Completed: Nov-04-2023,06:10:06
[2023-11-04T06:10:06] Seconds: 57864.748
[2023-11-04T06:10:06] Items: 1151121
[2023-11-04T06:10:06]   Directories: 35775
[2023-11-04T06:10:06]   Files: 1112126
[2023-11-04T06:10:06]   Links: 3220
[2023-11-04T06:10:06] Data: 47.312 TiB (52019905315792 bytes)
[2023-11-04T06:10:06] Rate: 857.345 MiB/s (52019905315792 bytes in 57864.748 seconds)
[2023-11-04T06:10:06] Updating timestamps on newly copied files
[2023-11-04T06:10:06] Completed updating timestamps
[2023-11-04T06:10:06] Completed sync
[dvicker@dvicker dsync]$ 

And the output from immediately running the second dsync.

[2023-11-04T06:41:24] Comparing file sizes and modification times of 1112126 items
[2023-11-04T06:41:24] Started   : Nov-04-2023, 06:41:23
[2023-11-04T06:41:24] Completed : Nov-04-2023, 06:41:24
[2023-11-04T06:41:24] Seconds   : 0.702
[2023-11-04T06:41:24] Items     : 1112126
[2023-11-04T06:41:24] Item Rate : 1112126 items in 0.702232 seconds (1583701.944237 items/sec)
[2023-11-04T06:41:24] Deleting items from destination
[2023-11-04T06:41:24] Removing 3 items
[2023-11-04T06:41:24] Removed 3 items in 0.002 seconds (1491.750 items/sec)
[2023-11-04T06:41:24] Copying items to destination
[2023-11-04T06:41:24] Copying to /nobackup2/dvicker/dsync_test/ccdev
[2023-11-04T06:41:24] Items: 3
[2023-11-04T06:41:24]   Directories: 0
[2023-11-04T06:41:24]   Files: 3
[2023-11-04T06:41:24]   Links: 0
[2023-11-04T06:41:24] Data: 55.123 GiB (18.374 GiB per file)
[2023-11-04T06:41:24] Creating 3 files.
[2023-11-04T06:41:24] Copying data.
[2023-11-04T06:41:45] Copied 10.328 GiB (19%) in 20.811 secs (508.184 MiB/s) 90 secs left ...
[2023-11-04T06:41:58] Copied 21.477 GiB (39%) in 34.094 secs (645.048 MiB/s) 53 secs left ...
[2023-11-04T06:42:21] Copied 34.646 GiB (63%) in 56.483 secs (628.105 MiB/s) 33 secs left ...
[2023-11-04T06:42:21] Copied 55.123 GiB (100%) in 56.483 secs (999.351 MiB/s) done
[2023-11-04T06:42:21] Copy data: 55.123 GiB (59188151153 bytes)
[2023-11-04T06:42:21] Copy rate: 999.351 MiB/s (59188151153 bytes in 56.483 seconds)
[2023-11-04T06:42:21] Syncing data to disk.
[2023-11-04T06:42:23] Sync completed in 2.393 seconds.
[2023-11-04T06:42:23] Setting ownership, permissions, and timestamps.
/nobackup2/dvicker/dsync_test/ccdev/boeing/aerothermal/steam_database_solutions/bp_results/MIIIb_traj_MC_690to700_run_00214.cdat/misc/processed_hifi_trajs.h5 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/Crew-3/boat-and-recovery-video/TRaTS Raw Converted Stills/00025276_000000000062B5E5_Lk5x5_B1250_L16384.bmp atime_sec=1699081733 atime_ns=0 mtime_sec=1651831020 mtime_ns=0
/nobackup2/dvicker/dsync_test/ccdev/boeing/aerothermal/steam_database_solutions/bp_results/MIIIb_traj_MC_690to700_run_00214.cdat/misc/processed_hifi_trajs.h5 rc=0 errno=0 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-11-04T02:06:19 mtime=2023-06-06T11:29:24
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/Crew-3/boat-and-recovery-video/TRaTS Raw Converted Stills/00025276_000000000062B5E5_Lk5x5_B1250_L16384.bmp rc=0 errno=0 atime_sec=1699081733 atime_ns=0 mtime_sec=1651831020 mtime_ns=0 atime=2023-11-04T02:08:53 mtime=2022-05-06T04:57:00
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-11-04T02:08:28 mtime=2023-01-11T05:34:17
[2023-11-04T06:42:23] Updated 3 items in 0.008 seconds (398.938 items/sec)
[2023-11-04T06:42:23] Syncing directory updates to disk.
[2023-11-04T06:42:25] Sync completed in 2.375 seconds.
[2023-11-04T06:42:25] Started: Nov-04-2023,06:41:24
[2023-11-04T06:42:25] Completed: Nov-04-2023,06:42:25
[2023-11-04T06:42:25] Seconds: 61.262
[2023-11-04T06:42:25] Items: 3
[2023-11-04T06:42:25]   Directories: 0
[2023-11-04T06:42:25]   Files: 3
[2023-11-04T06:42:25]   Links: 0
[2023-11-04T06:42:25] Data: 55.123 GiB (59188151153 bytes)
[2023-11-04T06:42:25] Rate: 921.392 MiB/s (59188151153 bytes in 61.262 seconds)
[2023-11-04T06:42:25] Updating timestamps on newly copied files

One thing stood out to me. The output said it was deleting and re-transferring 3 files, but the subsequent output shows 6 files.

I picked one of those files and grepped it out of the output.

[dvicker@dvicker dsync]$ grep 2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV dsync-testing.2023-11-03T13-27-57.out dsync-testing.2023-11-04T06-10-06.out
dsync-testing.2023-11-03T13-27-57.out:/nobackup/PROJECTS/ccdev.to/nobackup/PROJECTS/ccdev.to-delete/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1697860054 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-03T13-27-57.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1697860054 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-03T13-27-57.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1697860054 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-10-20T22:47:34 mtime=2023-01-11T05:34:17
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-11-04T02:08:28 mtime=2023-01-11T05:34:17
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-11-04T02:08:28 mtime=2023-01-11T05:34:17
[dvicker@dvicker dsync]$ sudo ls -l /nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV
-rw-rw---- 1 sajansse ccp_spacex 17839545344 Jan 11  2023 /nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV
[dvicker@dvicker dsync]$ 

Even with gunzip --best, these output files are over the upload limit so I had to split them into pieces. I've done that so I can upload them and you can see the entire output.

[dvicker@dvicker dsync]$ split -n 3 dsync-testing.2023-11-03T13-27-57.out{,.split}
[dvicker@dvicker dsync]$ split -n 2 dsync-testing.2023-11-04T06-10-06.out{,.split}

dsync-testing.2023-11-03T13-27-57.out.splitaa.gz dsync-testing.2023-11-03T13-27-57.out.splitab.gz dsync-testing.2023-11-03T13-27-57.out.splitac.gz dsync-testing.2023-11-04T06-10-06.out.splitaa.gz dsync-testing.2023-11-04T06-10-06.out.splitab.gz

adammoody commented 11 months ago

Thanks for running these additional tests.

I think there are just 3 files, but it prints two debug messages for each file.

Looking at the *.h5 file in the set, grep prints these 2 lines in the first dsync:

atime_sec=1696039618 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
rc=0 errno=0 atime_sec=1696039618 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-09-29T21:06:58 mtime=2023-06-06T11:29:24

And then for that file in the second dsync, I see these 4 lines:

atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
rc=0 errno=0 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-11-04T02:06:19 mtime=2023-06-06T11:29:24
atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
rc=0 errno=0 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-11-04T02:06:19 mtime=2023-06-06T11:29:24

So for this particular file, it looks dsync succeeded each time that it set the timestamps (rc=0 every time).

The mtime values did not change between the two runs (mtime_sec=1686068964 mtime_ns=0).

I do see that the access time changed from atime=2023-09-29T21:06:58 to atime=2023-11-04T02:06:19. That's probably due to dsync updating the atime when it reads the source file to begin with. The next release will have the --open-noatime to avoid that.

Given that the mtime is the same, I'm not sure why dsync decided to delete and re-copy the file. Perhaps the file size was different between the two?

Given that the mtime didn't change on the source file, and knowing that the file system was locked down, I really doubt the size of the source file changed. Maybe dsync never fully copied the file to the destination in the first run, even though it thought it had?

It would have been a good idea to print the file size, too, so that I could verify. There might be a couple other checks, like if the file type came up different somehow. I'll look to add that.

adammoody commented 11 months ago

@dvickernasa , I've updated the debugtimestamps branch. I merged a commit which required a force push, so hopefully that doesn't trip you up.

https://github.com/hpc/mpifileutils/tree/debugtimestamps

Since the mtime handling in the output of your last test looked to be correct, I've disabled the messages that print getting/setting timestamps on each file so that it's not as noisy. I added new messages to print what difference dsync finds between the source/destination when it decides to replace a destination file.

If you have a chance, would you please kick off a repeat of that test?

I realize that takes a while to run, so I appreciate your time. Thanks.

adammoody commented 11 months ago

Oh, this branch does include the new --open-noatime option, too, if you want to use that. I don't think that will interfere with the bug we're currently chasing. It should prevent dsync from updating the source atime values, which should also speed up the second dsync a bit. Without that option, I would expect the second dsync to be setting the timestamps on all destination files to update atime.

dvickernasa commented 11 months ago

I've compiled the update and I'll repeat the test. I'm not going to use the new option for now, just to be consistent. I'll email the output for the latest results when its done.

adammoody commented 11 months ago

Thanks, @dvickernasa . In this latest test, the second dsync copied 3 files. The new "DIFF" debug statements show that it detected a difference in the size/mtime check:

DIFF size/mtime src ... src_size=41347294011 dst_size=41347294011 src_mtime=1686068964 dst_mtime=1699435045
DIFF size/mtime src ... src_size=5307980 dst_size=5307980 src_mtime=1502986683 dst_mtime=1699435081
DIFF size/mtime src ... src_size=4759 dst_size=4759 src_mtime=1444167688 dst_mtime=1699434866

Here the file size values match, but the mtime values are different. The destination mtimes are all much more recent than the source file mtimes, and the destination mtime values look like they probably match up with the time at which the dsync was running.

From your previous test, we saw that dsync had correctly set the mtime on all files, including the 3 files that were copied by the second run. Unfortunately, I can't verify whether the mtime values had been set correctly in this test, because I dropped those messages.

I suspect that to be the case again, but I'd like to add those messages back and run once more to be certain.

Would you mind pulling those changes and running once more?

Thanks!

adammoody commented 11 months ago

With this next test, the goal is to look for messages that verify that the timestamps are being set correctly and then must somehow be getting changed again behind the scenes.

I'm back to wondering if this is due to outstanding writes that are buffered on the server and then getting applied after the timestamps have been set. Those "late writes" would change the mtime to the time that last write is applied. dsync has a sync() call and then an MPI_Barrier() which is meant to finalize all writes before setting any timestamps, but maybe that's not sufficient.

write all data
sync()
MPI_Barrier(MPI_COMM_WORLD);
set timestamps

For this test, you don't need to send the full output to me. You can grep for "DIFF" in the second run to find any files that were re-copied. Even one file should be sufficient. Then also grep for the source and destination path of that file in both the first and second run output. We should only need those lines.

Assuming that behaves as expected, I'd like to run one more test where we modify dsync to call fsync() before closing every file. I'd like to check whether that forces the writes to happen.

Thanks again for your help with this, @dvickernasa .

adilger commented 11 months ago

@adammoody just following along at home, and wanted to note that sync() on one client node is not going to flush out dirty data on all of the other client nodes, otherwise the servers would see a crippling number of sync writes as every connected client is causing sync writes on every other client. Also, IIRC, sync() at the OS level may start writes, but does not necessarily wait to finish all writes on all filesystems (this has changed over the years between kernels.

Once the data has been flushed from the clients to the servers, Lustre will generally write it to storage in order, so a later "sync" will ensure all earlier writes are also committed to storage.

The "guaranteed" way to ensure all of the data is flushed is to call fsync(fd) or fdatasync(fd) on each files written by each node. If dsync is writing files in batches, then it could f(data)sync(fd) in reverse order (and/or after some delay) so that most of the time it is not waiting for any data to be written or actually triggering a sync on the storage.

Calling the llapi_file_flush() interface will revoke the locks and cache on all clients to flush their data, but of course has no control over whether there are still active processes writing new data to the file, so that would need to be done at the end (I don't know if there is any coordination between writers?). Internally, this is calling llapi_get_data_version(). For debugging, calling llapi_get_data_vesion() when the data writers are finished, and then again before setting the timestamp would tell you if the data has been modified on any OST objects since it was supposedly flushed.

Since these APIs could trigger sync writes for every file accessed (if they have dirty data), it would best to call it on a file only when all of the writers are done, just before the timestamp is reset. That should avoid too much overhead.

While there is an S_NOCMTIME flag that we set on inodes when writing on the OSS (so that the local server clock does not mess with the client timestamps) there is unfortunately no kernel API to set this on the clients.

adammoody commented 11 months ago

Thanks, @adilger . I was hoping you might still be following this thread. The global barrier should ensure that sync has been called by all ranks before any rank starts to set timestamps. However, if sync returns before all data has been written by the servers, then that could lead to the problem that we're seeing.

It sounds like I'll need the fsync or fdatasync for this. We used to have an fsync in there. I think we took that out to improve performance and added the single sync() at the end in the hopes it would accomplish the same thing but more efficiently. I like the idea of reversing the order of fsync calls to speed things up, though I might need to reduce the batch size to avoid keeping too many files open.

For a large file that gets split across clients, I presume a single call to llapi_file_flush would be more efficient than having each client call fsync individually? That might be a good Lustre-specific optimization.

dvickernasa commented 11 months ago

@adammoody, the latest test completed. It didn't re-transfer files on the 2nd dsync this time so I'm not sure how useful the results are but I'll send them to you anyway.

I'm pulling the latest changes and repeating the test now.

adammoody commented 11 months ago

Thanks, @dvickernasa . Right, unfortunately it looks like this latest run didn't trigger the problem. Hopefully, the next run will catch it. Thanks as always for your help.

When you run using that latest branch, it will be quite verbose again as it prints a message whenever getting/setting timestamps.

dvickernasa commented 11 months ago

Sorry for the delay. I was at SC23 all last week and didn't get back to this. Looks like my latest transfer with the updated code caught one file.

[2023-11-14T03:38:05] Comparing file sizes and modification times of 1112126 items
[2023-11-14T03:38:05] Started   : Nov-14-2023, 03:38:05
[2023-11-14T03:38:05] Completed : Nov-14-2023, 03:38:05
[2023-11-14T03:38:05] Seconds   : 0.721
[2023-11-14T03:38:05] Items     : 1112126
[2023-11-14T03:38:05] Item Rate : 1112126 items in 0.721121 seconds (1542218.650501 items/sec)
[2023-11-14T03:38:05] Deleting items from destination
[2023-11-14T03:38:05] Removing 1 items
[2023-11-14T03:38:05] Removed 1 items in 0.001 seconds (964.208 items/sec)
[2023-11-14T03:38:05] Copying items to destination
[2023-11-14T03:38:05] Copying to /nobackup2/dvicker/dsync_test/ccdev
[2023-11-14T03:38:05] Items: 1
[2023-11-14T03:38:05]   Directories: 0
[2023-11-14T03:38:05]   Files: 1
[2023-11-14T03:38:05]   Links: 0
[2023-11-14T03:38:05] Data: 917.512 MiB (917.512 MiB per file)
[2023-11-14T03:38:05] Creating 1 files.
[2023-11-14T03:38:05] Copying data.
[2023-11-14T03:38:09] Copy data: 917.512 MiB (962081164 bytes)
[2023-11-14T03:38:09] Copy rate: 236.609 MiB/s (962081164 bytes in 3.878 seconds)
[2023-11-14T03:38:09] Syncing data to disk.
[2023-11-14T03:38:12] Sync completed in 2.778 seconds.
[2023-11-14T03:38:12] Setting ownership, permissions, and timestamps.           

@adammoody I'll email you the latest output.

adammoody commented 11 months ago

Thanks again, @dvickernasa . No problem, and I hope you had a good time at SC.

Yes, this set of runs captured good evidence of what we were looking for. Copying bits from your email, here is the note about the second run that finds a difference in mtime:

DIFF /src/file.mpg src_size=962081164 src_mstr=2023-04-15T08:54:32 src_mtime=1681566872 src_mtime2=1681566872
DIFF /dst/file.mpg dst_size=962081164 dst_mstr=2023-11-13T23:02:23 dst_mtime=1699938143 dst_mtime2=1699938143

The source file has an mtime of 2023-04-15T08:54:32 while the destination file has something more recent at 2023-11-13T23:02:23.

However, for this file, I see that the output from the first run reports that it had set the mtime to the correct value 2023-04-15T08:54:32 and without error rc=0:

dsync-testing.2023-11-13T10-00-06.out:dst /dst/file.mpg rc=0 errno=0 atime_sec=1699621797 atime_ns=0 mtime_sec=1681566872 mtime_ns=0 atime=2023-11-10T07:09:57 mtime=2023-04-15T08:54:32

Let me update the branch to enable the fsync() before closing each file. This may slow things down during the copy, but I'd like to see whether the problem still shows up in that case.

adammoody commented 11 months ago

@dvickernasa , I've pushed a commit to the debugtimestamps branch that forces the fsync during the copy.

https://github.com/hpc/mpifileutils/commit/c17d0c6566584167182ab025199a9e31c65016fe

When you get a chance, would you please pull this change, rebuild, and try another test?

This will add a lot of fsync calls, which may slow down the copy speed. It will also change the timing of things.

Given that this looks to be a race condition, the timing change alone might hide the problem. It'll be hard to trust that it's really gone without running more tests or running larger. On the other hand, if we still see the problem show up with fsync, we might then try the llapi_file_flush API.

I think what we're seeing is that some outstanding writes are being buffered by a server and applied after the timestamp. I'm looking for different ways to force those writes to finish.

dvickernasa commented 11 months ago

I'm re-running my test with the updated code. Should be finished in the morning. Since I've transferred this same set of files a few times now, we should get a good idea if the extra fsyncs's slow things down much.

adammoody commented 11 months ago

@adilger , assuming this is what is happening, I see the following in the notes section of the man 2 sync page on my system:

According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However Linux waits for I/O completions, and thus sync() or syncfs() provide the same guarantees as fsync called on every file in the system or filesystem respectively.

dsync currently uses sync as though it flushes (to disk) any data that the calling process has written. I think the Linux behavior described above would be sufficient, but the POSIX.1-2001 specification would not be.

Do you know how Lustre treats sync?

From your earlier comments, it sounds like it may be more along the lines of POSIX, where sync starts the writes but does not wait for them to finish. You've already described that above, but I just wanted to confirm. I feel like the text in the man page confuses things, since I know Lustre is POSIX-compliant, but this is also on a Linux system.

adilger commented 11 months ago

@adammoody I don't think that sync() will at most flush a single client's dirty writes to the filesystem.

dvickernasa commented 11 months ago

The updated test ran this morning. The 2nd transfer did not re-transfer anything, so that's encouraging. I'll be out the rest of the week for Thanksgiving but I'm going to put this in a loop and run in several times while I'm gone, so we'll have some more data next week.

Also, here is some data on how the extra sync's affected the transfer times. There is a lot of variation in the runs but it would appear that there isn't a huge affect. Or, it might be nice to add some (potentially just debug) timing statements to directly calculate the overhead of calling the sync.

[dvicker@dvicker dsync]$ for file in  dsync-testing.*.out; do
> grep Rate: $file | tail -1 
> done
[2023-10-16T20:12:58] Rate: 730.639 MiB/s (5357437501746 bytes in 6992.849 seconds)
[2023-10-17T10:19:23] Rate: 35.958 B/s (753 bytes in 20.941 seconds)
[2023-10-19T20:44:39] Rate: 756.352 MiB/s (5736324718486 bytes in 7232.852 seconds)
[2023-10-20T14:05:21] Rate: 655.065 MiB/s (5736324718486 bytes in 8351.211 seconds)
[2023-10-20T16:24:18] Rate: 700.910 MiB/s (5736324718486 bytes in 7804.975 seconds)
[2023-10-20T18:43:53] Rate: 693.610 MiB/s (5736324718486 bytes in 7887.125 seconds)
[2023-10-20T21:00:51] Rate: 701.543 MiB/s (5736324718486 bytes in 7797.935 seconds)
[2023-10-20T23:18:05] Rate: 703.717 MiB/s (5736324718486 bytes in 7773.846 seconds)
[2023-10-21T12:12:59] Rate: 479.823 MiB/s (2842477276278 bytes in 5649.579 seconds)
[2023-11-04T06:10:06] Rate: 857.345 MiB/s (52019905315792 bytes in 57864.748 seconds)
[2023-11-04T06:42:25] Rate: 921.392 MiB/s (59188151153 bytes in 61.262 seconds)
[2023-11-08T07:21:31] Rate: 843.019 MiB/s (52019905315792 bytes in 58848.101 seconds)
[2023-11-08T07:58:05] Rate: 877.777 MiB/s (41352606750 bytes in 44.928 seconds)
[2023-11-10T11:29:01] Rate: 836.292 MiB/s (52019905315792 bytes in 59321.474 seconds)
[2023-11-14T03:05:10] Rate: 835.834 MiB/s (52019905315792 bytes in 59353.928 seconds)
[2023-11-14T03:38:15] Rate: 98.241 MiB/s (962081164 bytes in 9.339 seconds)
[2023-11-21T09:28:14] Rate: 814.605 MiB/s (52019905315792 bytes in 60900.757 seconds)
[dvicker@dvicker dsync]$ 
adammoody commented 11 months ago

That's great. Thanks, @dvickernasa . Have a Happy Thanksgiving!

ofaaland commented 10 months ago

Hi @dvickernasa, hope you had a good Thanksgiving.

Is the destination file system built with OSTs backed by ZFS, and if so, is "options osd_zfs osd_object_sync_delay_us=" set on those servers? If so I might have an explanation for what we're seeing (although I'm not sure). Thanks!

dvickernasa commented 10 months ago

@adammoody, I got 6 more iterations on this done over the break. It looks like all of them found more data to copy on the 2nd transfer.

[dvicker@dvicker dsync]$ grep -e Removing dsync-testing.1.2023-11-22T11-04-07.out dsync-testing.2.2023-11-23T05-21-15.out dsync-testing.3.2023-11-24T00-25-20.out dsync-testing.4.2023-11-24T19-19-35.out dsync-testing.5.2023-11-25T13-22-08.out dsync-testing.6.2023-11-26T06-59-07.out
dsync-testing.1.2023-11-22T11-04-07.out:[2023-11-22T11:41:24] Removing 2 items
dsync-testing.2.2023-11-23T05-21-15.out:[2023-11-23T06:03:11] Removing 5 items
dsync-testing.3.2023-11-24T00-25-20.out:[2023-11-24T01:08:11] Removing 1 items
dsync-testing.4.2023-11-24T19-19-35.out:[2023-11-24T19:57:52] Removing 1 items
dsync-testing.5.2023-11-25T13-22-08.out:[2023-11-25T13:56:00] Removing 2 items
dsync-testing.6.2023-11-26T06-59-07.out:[2023-11-26T07:36:36] Removing 1 items
[dvicker@dvicker dsync]$ ll dsync-testing.?.*.out
 54186 -rw-r--r-- 1 root root 328067432 Nov 22 11:04 dsync-testing.1.2023-11-21T17-21-53.out
 67614 -rw-r--r-- 1 root root 321097260 Nov 22 11:52 dsync-testing.1.2023-11-22T11-04-07.out
 54351 -rw-r--r-- 1 root root 328053051 Nov 23 05:21 dsync-testing.2.2023-11-22T11-55-54.out
 68126 -rw-r--r-- 1 root root 322586738 Nov 23 06:15 dsync-testing.2.2023-11-23T05-21-15.out
 55228 -rw-r--r-- 1 root root 328051558 Nov 24 00:25 dsync-testing.3.2023-11-23T06-18-43.out
 68071 -rw-r--r-- 1 root root 322618344 Nov 24 01:29 dsync-testing.3.2023-11-24T00-25-20.out
 55603 -rw-r--r-- 1 root root 328055348 Nov 24 19:19 dsync-testing.4.2023-11-24T01-32-36.out
 68034 -rw-r--r-- 1 root root 322569590 Nov 24 20:20 dsync-testing.4.2023-11-24T19-19-35.out
 54397 -rw-r--r-- 1 root root 328050794 Nov 25 13:22 dsync-testing.5.2023-11-24T20-23-39.out
 67888 -rw-r--r-- 1 root root 322593290 Nov 25 14:07 dsync-testing.5.2023-11-25T13-22-08.out
 54698 -rw-r--r-- 1 root root 328059803 Nov 26 06:59 dsync-testing.6.2023-11-25T14-10-39.out
 68098 -rw-r--r-- 1 root root 322600133 Nov 26 07:37 dsync-testing.6.2023-11-26T06-59-07.out
[dvicker@dvicker dsync]$ 

I'll sent you some more details via email.

@ofaaland, for the testing I've been using in the last several posts (and most of the other examples in this issue), the source LFS does have ZFS OST's. The destination LFS is ldiskfs. Even so, we are setting the parameter you mentioned to zero.

[root@hpfs-fsl-oss00 ~]# cat /etc/modprobe.d/lustre.conf
# Lustre modprobe configuration

options lnet networks=tcp0(enp4s0),o2ib0(ib1),o2ib1(ib0)
options ko2iblnd map_on_demand=32
options osd_zfs osd_txg_sync_delay_us=0 osd_object_sync_delay_us=0
options zfs zfs_txg_history=120 zfs_txg_timeout=15 zfs_prefetch_disable=1 zfs_txg_timeout=30 zfs_vdev_scheduler=deadline

[root@hpfs-fsl-oss00 ~]# 
ofaaland commented 10 months ago

Hi @dvickernasa Since the destination LFS is ldiskfs, that rules out my theory. Back to the drawing board for us. Thank you.

adammoody commented 10 months ago

Thanks for that additional test output, @dvickernasa . This shows that the problem is still there even if we fsync() every file. I was hoping that might be a work around, but it's instructive to know that it's not.

I've pushed a couple more commits to the branch.

The first one calls llapi_file_flush() as @adilger suggested. I left the fsync in there, so now dsync will be doing both. I'd like to see whether this might do better than fsync. As added, this will only work properly when the destination file system is Lustre.

I also added a call to stat the destination file and check the timestamps immediately after setting them. This will print a new DIFF message if it finds a difference. For any file that is re-copied in the second run, if we do not see those new DIFF messages show up in the output of the first run, then it helps further verify that the mtime was set correctly to begin with.

When you have a chance, would you please pull, rebuild, and submit a new round of tests?

dvickernasa commented 10 months ago

I've started my test with the updated code.

dvickernasa commented 10 months ago

@adammoody, we talked a while ago about adding the --input option to dsync. If its easy, that would come in handy for all these tests. It takes about 30 minutes to walk the source data for the test I've been running (which is on our older lustre filesystem).

adammoody commented 10 months ago

Oh, right. That slipped my mind. I'll work on that.

dvickernasa commented 10 months ago

The latest test finished (sorry for the delay). The latest run found 9 files to retransfer.

[2023-11-29T11:34:37] Started   : Nov-29-2023, 11:34:36
[2023-11-29T11:34:37] Completed : Nov-29-2023, 11:34:37
[2023-11-29T11:34:37] Seconds   : 0.717
[2023-11-29T11:34:37] Items     : 1112126
[2023-11-29T11:34:37] Item Rate : 1112126 items in 0.717182 seconds (1550688.322885 items/sec)
[2023-11-29T11:34:37] Deleting items from destination
[2023-11-29T11:34:37] Removing 9 items
[2023-11-29T11:34:37] Removed 9 items in 0.019 seconds (473.855 items/sec)
[2023-11-29T11:34:37] Copying items to destination
[2023-11-29T11:34:37] Copying to /nobackup2/dvicker/dsync_test/ccdev
[2023-11-29T11:34:37] Items: 9
[2023-11-29T11:34:37]   Directories: 0
[2023-11-29T11:34:37]   Files: 9
[2023-11-29T11:34:37]   Links: 0
[2023-11-29T11:34:37] Data: 101.952 GiB (11.328 GiB per file)
[2023-11-29T11:34:37] Creating 9 files.
[2023-11-29T11:34:37] Copying data.
[2023-11-29T11:34:51] Copied 9.484 GiB (9%) in 14.754 secs (658.244 MiB/s) 144 secs left ...
[2023-11-29T11:36:20] Copied 20.059 GiB (20%) in 103.611 secs (198.241 MiB/s) 423 secs left ...
[2023-11-29T11:36:20] Copied 101.952 GiB (100%) in 103.611 secs (0.984 GiB/s) done
[2023-11-29T11:36:20] Copy data: 101.952 GiB (109470191676 bytes)
[2023-11-29T11:36:20] Copy rate: 0.984 GiB/s (109470191676 bytes in 103.611 seconds)
[2023-11-29T11:36:20] Syncing data to disk.
[2023-11-29T11:36:23] Sync completed in 2.766 seconds.
[2023-11-29T11:36:23] Setting ownership, permissions, and timestamps.
adammoody commented 10 months ago

Thanks, @dvickernasa . Wow, this problem is stubborn. So it would seem the llapi_file_flush() calls are also not helping.

Based on the data from the latest run that you emailed to me separately, I don't see any of the new DIFF messages that would have been printed by the first run if it detected a wrong mtime immediately after it set the timestamp. That further suggests that the timetamps are being set correctly initially and then being changed again later.

Even more puzzling is that several of the files that have incorrect mtimes are of size=0. There should not be any write() calls issued for those files. Maybe this not a buffered write after all, but something else. Let me take another look through the code with that in mind.

BTW, I am working to have the --input-src option for your next run.