hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
164 stars 64 forks source link

Lustre file system sync option to dsync #495

Open cmharr opened 3 years ago

cmharr commented 3 years ago

It would be useful to have a mode for dsync to sync a full file system. The main enhancements would entail splitting up directories to sync into sub-jobs and taking advantage of Lustre DNE (Ph 1) to use multiple metadata (MD) servers. Optionally, Slurm integration would allow spawning of new jobs.

The high-level order of operations would follow:

Thoughts: A new "analysis" stage could parse the dwalk output and break it up into chunks. One approach could simply spin off a dsync job for each root-level directory in the file system. While the size of those jobs would be unbalanced, dsync could size the number of CPUs per job accordingly. An advantage of this approach would be simplicity and the available to have a log for each root-level directory (such as user directories). A different approach would be to split up the dwalk output into more equitable chunks of data and spawn a job for each of those.

Once data is ready to be synced, it is important to stripe that data across all the MDTs on the destination file system. dsync could run lfs mdts on the destination client to get a listing of MDTs (may need the user to specify the f/s device), and then round-robin the directory creation across those MDTs; user could specify if subdirectories should stay on same MDT as parents, or land on a randomly-assigned MDT.

Once directories are precreated on destination f/s, the file creation and data/metadata sync could follow as usual.

adilger commented 3 years ago

MDT space balancing is being implemented improved in 2.15, but that release is not ready yet.

For existing filesystems, I would not recommend using DNE2 striped directories for all of the directories in the filesystem, as too many/needless striped directories can cause performance issues. Striped directories should only be used for top-level directories (eg. .../home or .../scratch or .../project) to distribute subdirectories expected to have a large number of files beneath it, or individual directories with 1M or more entries.

Using DNE1 remote directory for each subtree is easier to manage if dsync is making this decision itself, rather than depending on the hash of the filename to pick a good MDT in a striped directory. Ideally, if dsync knows the size of each subtree it would balance them across MDTs based on how much free space they have.

However, in the absence of information about the size of the subtree, what 2.14 (in lfs mkdir) or 2.15 (internally for any mkdir()) do is select the MDT for each new directory with random weighted selection using (free_inodes>>8 * free_blocks>>16) for each MDT, but staying on the same MDT if the current weight is above the mean of all MDTs.

This roughly balances MDT usage, and typically reduces usage of MDT0000, which is often the most full.

cmharr commented 3 years ago

Andreas, Thanks for correcting me. I meant DNE1, not 2, with dsync selecting (round robin or a more intelligent, weighted selection as you propose) the MDTs on which to locate the directories. I'm aware of the performance impacts on phase 2. In the dsync job I just finished, one user root directory had > 49M subdirectories and ~ 250M files in it (though I don't know that there are any subdirectories with > 1M entries). That was the case that prompted the desire to spread the load across MDTs for subdirectories.