Open cmharr opened 3 years ago
MDT space balancing is being implemented improved in 2.15, but that release is not ready yet.
For existing filesystems, I would not recommend using DNE2 striped directories for all of the directories in the filesystem, as too many/needless striped directories can cause performance issues. Striped directories should only be used for top-level directories (eg. .../home
or .../scratch
or .../project
) to distribute subdirectories expected to have a large number of files beneath it, or individual directories with 1M or more entries.
Using DNE1 remote directory for each subtree is easier to manage if dsync is making this decision itself, rather than depending on the hash of the filename to pick a good MDT in a striped directory. Ideally, if dsync knows the size of each subtree it would balance them across MDTs based on how much free space they have.
However, in the absence of information about the size of the subtree, what 2.14 (in lfs mkdir
) or 2.15 (internally for any mkdir()
) do is select the MDT for each new directory with random weighted selection using (free_inodes>>8 * free_blocks>>16) for each MDT, but staying on the same MDT if the current weight is above the mean of all MDTs.
This roughly balances MDT usage, and typically reduces usage of MDT0000, which is often the most full.
Andreas, Thanks for correcting me. I meant DNE1, not 2, with dsync selecting (round robin or a more intelligent, weighted selection as you propose) the MDTs on which to locate the directories. I'm aware of the performance impacts on phase 2. In the dsync job I just finished, one user root directory had > 49M subdirectories and ~ 250M files in it (though I don't know that there are any subdirectories with > 1M entries). That was the case that prompted the desire to spread the load across MDTs for subdirectories.
It would be useful to have a mode for
dsync
to sync a full file system. The main enhancements would entail splitting up directories to sync into sub-jobs and taking advantage of Lustre DNE (Ph 1) to use multiple metadata (MD) servers. Optionally, Slurm integration would allow spawning of new jobs.The high-level order of operations would follow:
Thoughts: A new "analysis" stage could parse the
dwalk
output and break it up into chunks. One approach could simply spin off adsync
job for each root-level directory in the file system. While the size of those jobs would be unbalanced,dsync
could size the number of CPUs per job accordingly. An advantage of this approach would be simplicity and the available to have a log for each root-level directory (such as user directories). A different approach would be to split up thedwalk
output into more equitable chunks of data and spawn a job for each of those.Once data is ready to be synced, it is important to stripe that data across all the MDTs on the destination file system.
dsync
could runlfs mdts
on the destination client to get a listing of MDTs (may need the user to specify the f/s device), and then round-robin the directory creation across those MDTs; user could specify if subdirectories should stay on same MDT as parents, or land on a randomly-assigned MDT.Once directories are precreated on destination f/s, the file creation and data/metadata sync could follow as usual.