Checksumming a list of files in parallel

hpc / mpifileutils

File utilities designed for scalability and performance.

https://hpc.github.io/mpifileutils

BSD 3-Clause "New" or "Revised" License

169 stars 68 forks source link

Checksumming a list of files in parallel #321

Open shawnahall71 opened 4 years ago

shawnahall71 commented 4 years ago

Hi - is there a way I'm not thinking of to use mpifileutils tools to md5sum a list of files in parallel. I recently ended up needing to use xargs to do checksum a list of ~50 files on a single compute node, but it would have been nicer to use something like dparallel to checksum that list across multiple nodes. There's not much dparallel documentation, but from what I see I'm thinking dparallel is intended for doing the same thing in parallel on several systems as opposed to doing the same task to a list of files that gets split up. Thanks!

adammoody commented 4 years ago

Hi @shawnahall71 , sorry for the slow reply. Yes, dparallel is intended to support something like that. It is experimental, though, because it requires a fork/exec to execute the command, and not all MPI libraries support fork/exec from an MPI process.

I'm not sure whether this particular use case would work, but I can take a look.

Did you already find a dparallel command that works?

shawnahall71 commented 4 years ago

Hi @adammoody - no worries - this wasn't a high priority issue. No, I never found the right syntax for a dparallel command to do something like this. I ended up using a xargs command like this one: https://stackoverflow.com/a/22822755 which gave me decent single node parallelism. I'm just curious for the next time I do something like this if there's a way to do it with a mpifileutils tool.

FWIW, we generally use Intel MPI. We'll occasionally get requests to move data on behalf of our users, and when I can I like to use dcp for the copy. It's also common to be asked to checksum the data.

adammoody commented 4 years ago

The dparallel code looks to be incomplete for this kind of operation currently, though that's what it is intended for.

However, dfind supports a limited syntax for an exec-style operation. I was able to get close with this command:

dfind path/to/walk --type f --exec md5sum {} \;

The argument parsing currently segfaults if it fails to find the trailing ";", so the user interface could be improved in that regard.

Another problem is that it outputs status messages during the walk that will be prepended to the output from the command you are running.

[2020-02-20T12:28:01] Walking /path/to/walk
[2020-02-20T12:28:01] Walked 32 items in 0.035808 secs (893.664793 items/sec) ...
[2020-02-20T12:28:01] Walked 32 items in 0.035910 seconds (891.107550 items/sec)
a0a70868c128191b53d59a1b3ea1c614  /path/to/walk/foo.txt
18347b4442e56de09ca1530bd1e1ad18  /path/to/walk/bar.txt

However, it did seem to work.

shawnahall71 commented 4 years ago

That's still a pretty good alternative - didn't think of that. Thanks!

adammoody commented 4 years ago

As for checking data after a copy, our team also runs dcmp on both source and destination. This compares bytes between source and destination files and reports any differences. A good check to verify that the copy succeeded.

shawnahall71 commented 4 years ago

True. While I wished we used mpifileutils more for "internal" transfers, several times that I've used the tools are for more geographically distant transfers where a dcmp would mean a second read of the data so I've used dcmp -l in those cases. Plus those cases often incur a financial cost for a second read (e.g. cloud egress costs).