hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
164 stars 64 forks source link

dsync: no even distribution of data on the Lustre OST's #584

Open BerndKrischok opened 1 week ago

BerndKrischok commented 1 week ago

Hello, we are trying to migrate data from an old Lustre filesystem to a new Lustre filesystem using dsync. Now we see that most of the data will be located on only 1 OST after data migration. An simple test using a dsync for more than 2700 files confirmed that: 2600 files are located on 1 OST, and only 100 files are distributed over 8 different OST's. In total we have 40 OST's in our Lustre filesystem available. In contrast, if I use a standard cp command to copy the 2700 files, the files are all distributed evenly across all available OSTs. So I think there is a Problem using dsync on a Lustre filesystem.

Is this problem known?

Thanks Best regards, Bernd

adilger commented 1 week ago

This depends on how dsync is setting the file layout for the target files, for which I don't know the details.

It seems possible that dsync is copying the layout xattr in a way that confuses Lustre to think that it is specifying a single target OST index for the new files, instead of specifying index "-1" for the new files. Alternately, is it possible that there is a default layout on one of the parent directories that is forcing the files to be on a particular OST. Is it OST0000 that is getting all of the new files?

Alternately it might be that Lustre itself is not using the other OSTs for some reason, like max_create_count=0, but I think this is unlikely if "cp" is not showing the same behavior.

I think there is an option to dsync to restripe the files on the target filesystem? If you set up a reasonable default PFL layout on the target filesystem root directory, and there are no explicit default layouts set on the target directory tree, then preserving the original file layout is less important, especially if your users/applications do not set their own striping, and if the OST count of the target filesystem is different.

adilger commented 1 week ago

PS: please update this issue if you figure out the cause, since this behaviour definitely should not be the default for dsync and should be fixed in dsync and/or Lustre.

adilger commented 1 week ago

It looks like this may be related to the Lustre issue https://jira.whamcloud.com/browse/LU-13062 that has a patch https://review.whamcloud.com/45252 "LU-13062 llite: return stripe_offset -1 in trusted.lov".

That patch has stalled out because of side-effects causing failures in a number of regression tests, but it probably needs more attention.

BerndKrischok commented 1 week ago

Just some more details. The stripe configuration of the parent target dir on lustre is:

lfs getstripe -d .
  lcm_layout_gen:    0
  lcm_mirror_count:  1
  lcm_entry_count:   3
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1073741824
      stripe_count:  1       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 1073741824
    lcme_extent.e_end:   4294967296
      stripe_count:  4       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 4294967296
    lcme_extent.e_end:   EOF
      stripe_count:  8       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1

Most of the data is now located after the migration from the old filesystem to the new filesystem on OST 0. So we set max_create_count=0 for OST 0 to prevent creating new files on OST 0. A new test of copying lots of files using dsync or dcp from the old filesystem to the new filesystem will be now located on OST 8. There is unfortunately no distribution of data to the OST's using the mpifileutils - there will be all created on only 1 OST.

Best regards, Bernd

adilger commented 1 week ago

@adammoody what mechanism is dsync using to copy the file layout on a Lustre filesystem? Is it just copying the whole lustre.lov/trusted.lov xattr, or is it using llapi functions to load/store the layout? I notice https://jira.whamcloud.com/browse/LU-16500 has a patch to llapi to reset the OST indexes when migrating a file with "lfs migrate", and I wonder if something similar can be done in dsync?

adilger commented 3 days ago

There is a small client-side patch https://review.whamcloud.com/45252 available that may address this issue. It would be helpful if someone could test this with dsync to see if it resolves the issue. The patch has no interoperability concerns and is only needed in the clients where dsync is running.

BerndKrischok commented 7 hours ago

today we are testing the small client patch https://review.whamcloud.com/45252 on a small node partition. As a first result it works fine with dsync. Thank you very much. Next step is to roll out the clients in a production partition.