hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
169 stars 68 forks source link

dbz2 multinode fault? #579

Open samlor opened 3 months ago

samlor commented 3 months ago

My tests indicate that, for all but very small files of only a few MBs, dbz2 works in parallel on a single node but appears to persistently fail when distributed over multiple nodes in a HPC cluster. Sometimes failing with errors, or worse, the resulting output does not match the original file after decompression. Is this a documented/known limitation or am I doing something wrong?

--- Session transcript --- $ uname -a Linux sms 4.18.0-513.11.1.el8_9.0.1.x86_64 #1 SMP Sun Feb 11 10:42:18 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux $ cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 79 | head -1000000 > 80M.txt $ mpirun -np 4 ~/mpifileutils-v0.11.1/install/bin/dbz2 --compress --keep 80M.txt $ mv 80M.txt.dbz2 80M1n4p.txt.dbz2 $ mpirun -np 4 -H c1:2,c2:2 ~/mpifileutils-v0.11.1/install/bin/dbz2 --decompress --keep 80M1n4p.txt.dbz2 $ cmp 80M.txt 80M1n4p.txt $ mpirun -np 4 -H c1:2,c2:2 ~/mpifileutils-v0.11.1/install/bin/dbz2 --compress --keep 80M.txt $ mv 80M.txt.dbz2 80M2n4p.txt.dbz2 $ mpirun -np 4 -H c1:2,c2:2 ~/mpifileutils-v0.11.1/install/bin/dbz2 --decompress --keep 80M2n4p.txt.dbz2 [2024-07-31T11:29:40] [0] [/home/slr/mpifileutils-v0.11.1/mpifileutils/src/common/mfu_bz2_static.c:596] ERROR: Error in decompression [2024-07-31T11:29:40] [1] [/home/slr/mpifileutils-v0.11.1/mpifileutils/src/common/mfu_bz2_static.c:596] ERROR: Error in decompression $ ~/mpifileutils-v0.11.1/install/bin/dbz2 --decompress --keep 80M2n4p.txt.dbz2 [2024-07-31T11:34:07] [0] [/home/slr/mpifileutils-v0.11.1/mpifileutils/src/common/mfu_bz2_static.c:596] ERROR: Error in decompression $

gonsie commented 3 months ago

ping @adammoody

adammoody commented 2 months ago

Thanks for the report, @samlor . dbz2 writes to a single shared file from multiple processes. For correctness, it requires a POSIX-compliant parallel file system like Lustre or IBM's Spectrum Scale. In particular, many NFS file systems are not POSIX-compliant.

Do you the type of the backing file system where the compressed file is being written here?

Do you have a POSIX-compliant file system that you try as a test?

samlor commented 2 months ago

G'day Adam,

Oh, thank you for responding.

Yes, it's on a test system with just NFS exported XFS.

Don't know about POSIX compliance but I presume ordinary XFS is not even a distributed/parallel file system.

If this is the issue then I can/will try it on Lustre, to confirm, thanks.

Cheers, Sam

On Sun, 25 Aug 2024 at 07:10, Adam Moody @.***> wrote:

Thanks for the report, @samlor https://github.com/samlor . dbz2 writes to a single shared file from multiple processes. For correctness, it requires a POSIX-compliant parallel file system like Lustre or IBM's Spectrum Scale. In particular, many NFS file systems are not POSIX-compliant.

Do you the type of the backing file system where the compressed file is being written here?

Do you have a POSIX-compliant file system that you try as a test?

— Reply to this email directly, view it on GitHub https://github.com/hpc/mpifileutils/issues/579#issuecomment-2308538674, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADLL5JEUQI3ULUHABYGECZDZTDZC7AVCNFSM6AAAAABLXO3IJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBYGUZTQNRXGQ . You are receiving this because you were mentioned.Message ID: @.***>