hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
162 stars 64 forks source link

Slow sync() process when memory buff/cache is high. #533

Closed hwleong closed 1 year ago

hwleong commented 1 year ago

I noticed that when memory buff/cache is high (as seen in "free" command output), the sync() process took a long time. This can be seen in mfu_sync_all("Syncing data to disk.") as well as mfu_sync_all("Syncing directory updates to disk."). This happens even in situation where there is zero file copied using dsync. See an example below:

[2022-08-10T23:23:16] Walking source path
[2022-08-10T23:23:16] Walking /path/to/source
[2022-08-10T23:23:16] Walked 3 items in 0.013 secs (223.431 items/sec) ...
[2022-08-10T23:23:16] Walked 3 items in 0.016 seconds (186.631 items/sec)
[2022-08-10T23:23:16] Walking destination path
[2022-08-10T23:23:16] Walking /path/to/destination
[2022-08-10T23:23:16] Walked 1 items in 0.006 secs (156.974 items/sec) ...
[2022-08-10T23:23:16] Walked 1 items in 0.008 seconds (118.084 items/sec)
[2022-08-10T23:23:16] Started   : Aug-10-2022, 23:23:16
[2022-08-10T23:23:16] Completed : Aug-10-2022, 23:23:16
[2022-08-10T23:23:16] Seconds   : 0.002
[2022-08-10T23:23:16] Items     : 0
[2022-08-10T23:23:16] Item Rate : 0 items in 0.001805 seconds (0.000000 items/sec)
[2022-08-10T23:23:16] Copying items to destination
[2022-08-10T23:23:16] Copying to /path/to/destination
[2022-08-10T23:23:16] Items: 2
[2022-08-10T23:23:16]   Directories: 2
[2022-08-10T23:23:16]   Files: 0
[2022-08-10T23:23:16]   Links: 0
[2022-08-10T23:23:16] Data: 0.000 B (0.000 B per file)
[2022-08-10T23:23:16] Creating 2 directories
[2022-08-10T23:23:16] Copying data.
[2022-08-10T23:23:16] Copy data: 0.000 B (0 bytes)
[2022-08-10T23:23:16] Copy rate: 0.000 B/s (0 bytes in 0.165 seconds)
[2022-08-10T23:23:16] Syncing data to disk.
[2022-08-10T23:23:41] Sync completed in 25.435 seconds.
[2022-08-10T23:23:41] Setting ownership, permissions, and timestamps.
[2022-08-10T23:23:41] Updated 2 items in 0.005 seconds (409.988 items/sec)
[2022-08-10T23:23:41] Syncing directory updates to disk.
[2022-08-10T23:24:07] Sync completed in 26.117 seconds.
[2022-08-10T23:24:07] Started: Aug-10-2022,23:23:16
[2022-08-10T23:24:07] Completed: Aug-10-2022,23:24:07
[2022-08-10T23:24:07] Seconds: 51.733
[2022-08-10T23:24:07] Items: 2
[2022-08-10T23:24:07]   Directories: 2
[2022-08-10T23:24:07]   Files: 0
[2022-08-10T23:24:07]   Links: 0
[2022-08-10T23:24:07] Data: 0.000 B (0 bytes)
[2022-08-10T23:24:07] Rate: 0.000 B/s (000 bytes in 51.733 seconds)
[2022-08-10T23:24:07] Updating timestamps on newly copied files
[2022-08-10T23:24:07] Completed updating timestamps
[2022-08-10T23:24:07] Completed sync

To mitigate this issue, I ended up have to manually trigger sync; echo 3 > /proc/sys/vm/drop_caches to clear the cache, and the sync() will complete very quickly, just few seconds even for over a million of files. Of course, I have to become root to do this, which is not generally available for a standard normal user.

Is this a known issue, or there is some tuning that can be applied?

adammoody commented 1 year ago

Thanks for the report and the timing info, @hwleong .

The slowdown happens because sync() flushes buffered data from cache for all files on all file systems. From man 2 sync:

sync() causes all pending modifications to filesystem metadata and cached file data to be written to the underlying filesystems.

With a full buffer cache, this could take a long time, especially if data needs to be flushed to slow file systems.

Calling sync() here is probably overkill. When using dsync, we really just need to flush data for any files that it copies or at least we could limit things to the destination file system(s).

There are a couple reasons why it may have been added. When copying many files, a single call to sync may be faster than calling fsync individually for each file. A second reason sync may be used is due to an old problem with Lustre, based on this commit: https://github.com/hpc/mpifileutils/commit/d1ebddb0707b269aa9d51259101175d39d837f90

As a future fix, we'd need to investigate whether sync can be avoided altogether, and remove it if it is not needed. Another option would be to define a command-line option to let a user disable the calls to sync.

For now, we don't have any way to disable the sync call. As a short term workaround, you could comment out those calls and rebuild.