hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
170 stars 68 forks source link

[Request] run ddup without changing atime #557

Closed markmoe19 closed 1 year ago

markmoe19 commented 1 year ago

Could ddup be enhanced for allowing a restore of last accessed time (atime) on a file after it is read for de-duplication comparison purposes? We find atime useful for alerting users to unused files to clean-up. We would also like to use ddup but it changes atime.

adammoody commented 1 year ago

That's a good idea, and it's related to the O_NOATIME request in https://github.com/hpc/mpifileutils/pull/534 for reading files in dcp/dsync, which can be used to avoid modifying the atime value.

adammoody commented 1 year ago

As a test, can you check whether adding the O_NOATIME flag to the open call in ddup works:

https://github.com/hpc/mpifileutils/blob/47918154ea0f4895623f36ccf8cbfe2df477c3ae/src/ddup/ddup.c#L96

so that this line changes to

mfu_open(fname, O_RDONLY | O_NOATIME)

As noted in https://github.com/hpc/mpifileutils/pull/534, in addition to keeping atime the same, that could also improve read performance. It cuts out a bunch of atime updates going to the Lustre metadata server.

I like this as a general improvement for ddup, but we I'll need to verify that this doesn't create problems for file systems that might not support the O_NOATIME flag. If nothing else, we could enable the flag by default and add a new command line option to drop it.

markmoe19 commented 1 year ago

I got the below error. Looks like src/dcp1/compare.c has O_NOATIME in some commented out code, maybe it was considered there as well at one time?

/project/selene-admin/mpifileutils_debug/mpifileutils-v0.11.1/mpifileutils/src/ddup/ddup.c:102:41: error: ‘O_NOATIME’ undeclared (first use in this function) 102 | int fd = mfu_open(fname, O_RDONLY | O_NOATIME); | ^~~~~

adammoody commented 1 year ago

Oh, we probably also need to add a #define _GNU_SOURCE statement before any includes in order to pick up the definition for O_NOATIME. Can you try again after adding #define _GNU_SOURCE to the very top of the ddup.c file?

markmoe19 commented 1 year ago

Ok, great, I got this to compile now, just need to test ...

Does dsync change atime? This made we think that O_NOATIME would be nice to ddup, dsync, etc. tools as we don't want to change atime or mtime while looking for old files to clean-up or move. :)

adammoody commented 1 year ago

Yes, other tools current change atime, as well. After we confirm that this is working in ddup, I'll work to add O_NOATIME to dcmp, dcp, and dsync. dtar would be another potential target.

markmoe19 commented 1 year ago

It works nicely. Atime is not changed and duplicate files are found, see attached snippet of text showing "ls -alu" output for atime. (I did cut out some of the file list to trim down the snippet size). Thanks!

snippet.txt

adammoody commented 1 year ago

Ok, good. Thanks for testing. I'll make that O_NOATMIE change to ddup and start looking at the other tools.

markmoe19 commented 1 year ago

some adventurers in atime:

markmoe19 commented 1 year ago

Looks like newer versions of rsync can preserve atime on both source and target, which seems ideal for our use case. I think that would be a nice goal for dsync as well. :)

https://unix.stackexchange.com/questions/630228/rsync-keep-access-time-atime-how

Since rsync version 3.2.0, there are two flags that affect atimes:

--atimes, -U preserve access (use) times --open-noatime avoid changing the atime on opened files The full description of these is:

   --atimes, -U
          This  tells  rsync to set the access (use) times of the destina‐
          tion files to the same value as the source files.

          If repeated, it also sets the --open-noatime option,  which  can
          help you to make the sending and receiving systems have the same
          access times on the transferred files  without  needing  to  run
          rsync an extra time after a file is transferred.

          Note  that  some  older rsync versions (prior to 3.2.0) may have
          been built with a pre-release --atimes patch that does not imply
          --open-noatime when this option is repeated.

   --open-noatime
          This  tells rsync to open files with the O_NOATIME flag (on sys‐
          tems that support it) to avoid changing the access time  of  the
          files  that  are being transferred.  If your OS does not support
          the O_NOATIME flag then rsync will silently ignore this  option.
          Note  also  that  some filesystems are mounted to avoid updating
          the atime on read access even without the O_NOATIME  flag  being
          set.

So my reading of these (which testing has borne out) is that the following will both keep rsync from updating the atime on src, and will copy the atime of src to dest:

rsync -UU

adammoody commented 1 year ago

Thanks, @markmoe19 . I just merged https://github.com/hpc/mpifileutils/pull/561 to add a new --open-noatime option to various tools to enable this.

Since adding O_NOATIME can lead to an error for normal users when reading files they don't own, enabling it via a new option is a good way to go.

With this, dsync --open-noatime should avoid updating atime on source files. And as before, by default, dsync currently copies atime from source to destination files when it set the destination timestamps, so that the destination atime should match the source.

markmoe19 commented 1 year ago

--open-noatime option sounds great, thanks @adammoody !

adammoody commented 1 year ago

Great. I'll close this one out as resolved by https://github.com/hpc/mpifileutils/pull/561