hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
162 stars 64 forks source link

dsync generates MPI error when I'm not the owner of the source path #550

Open Aelmazaty opened 1 year ago

Aelmazaty commented 1 year ago

Hello,

I've installed mpifileutils version 0.11.1 using spack. I always get an MPI error when I am not the owner of the source file/directory. Although I have at least read permissions. The files are copied however this error is still generated. It's annoying as it was submitted as an LSF or SLURM job it will be mrked as failed. No errors are generated if I am the owner of the source.

Example: [aelmazaty@codon-dm-06 lsf-hx-wp]# ls -l /hps/scratch/sysinf/power_usage -rw-r--r-- 1 root root 17035 Sep 5 2022 /hps/scratch/sysinf/power_usage

[aelmazaty@codon-dm-06 lsf-hx-wp]# mpirun -np 4 dsync -v --progress 1 /hps/scratch/sysinf/power_usage /hps/scratch/sysinf/aelmazaty/ [2023-06-13T16:01:14] Walking source path [2023-06-13T16:01:14] Walking /hps/scratch/sysinf/power_usage [2023-06-13T16:01:14] Walked 1 items in 0.001 secs (882.196 items/sec) ... [2023-06-13T16:01:14] Walked 1 items in 0.001 seconds (818.132 items/sec) [2023-06-13T16:01:14] Walking destination path [2023-06-13T16:01:14] Walking /hps/scratch/sysinf/aelmazaty [2023-06-13T16:01:14] Walked 1 items in 0.002 secs (617.520 items/sec) ... [2023-06-13T16:01:14] Walked 1 items in 0.002 seconds (606.374 items/sec) [2023-06-13T16:01:14] Comparing file sizes and modification times of 1 items [2023-06-13T16:01:14] Started : Jun-13-2023, 16:01:14 [2023-06-13T16:01:14] Completed : Jun-13-2023, 16:01:14 [2023-06-13T16:01:14] Seconds : 0.000 [2023-06-13T16:01:14] Items : 1 [2023-06-13T16:01:14] Item Rate : 1 items in 0.000158 seconds (6310.263012 items/sec) [2023-06-13T16:01:14] Updating timestamps on newly copied files

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[11234,1],0] Exit code: 1

The file is copied however an error is generated

When I try with a file I own: [aelmazaty@codon-dm-06 lsf-hx-wp]# ls -l /hps/scratch/sysinf/power_usage_aelmazaty -rw-r--r-- 1 aelmazaty systems 17035 Jun 13 13:54 /hps/scratch/sysinf/power_usage_aelmazaty [aelmazaty@codon-dm-06 lsf-hx-wp]# mpirun -np 4 dsync -v --progress 1 /hps/scratch/sysinf/power_usage_aelmazaty /hps/scratch/sysinf/aelmazaty/ [2023-06-13T16:02:17] Walking source path [2023-06-13T16:02:17] Walking /hps/scratch/sysinf/power_usage_aelmazaty [2023-06-13T16:02:17] Walked 1 items in 0.001 secs (872.339 items/sec) ... [2023-06-13T16:02:17] Walked 1 items in 0.001 seconds (804.228 items/sec) [2023-06-13T16:02:17] Walking destination path [2023-06-13T16:02:17] Walking /hps/scratch/sysinf/aelmazaty [2023-06-13T16:02:17] Walked 1 items in 0.000 secs (2210.726 items/sec) ... [2023-06-13T16:02:17] Walked 1 items in 0.000 seconds (2045.349 items/sec) [2023-06-13T16:02:17] Comparing file sizes and modification times of 1 items [2023-06-13T16:02:17] Started : Jun-13-2023, 16:02:17 [2023-06-13T16:02:17] Completed : Jun-13-2023, 16:02:17 [2023-06-13T16:02:17] Seconds : 0.000 [2023-06-13T16:02:17] Items : 1 [2023-06-13T16:02:17] Item Rate : 1 items in 0.000162 seconds (6177.720668 items/sec) [2023-06-13T16:02:17] Deleting items from destination [2023-06-13T16:02:17] Removing 1 items [2023-06-13T16:02:17] Removed 1 items in 0.003 seconds (327.228 items/sec) [2023-06-13T16:02:17] Copying items to destination [2023-06-13T16:02:17] Copying to /hps/scratch/sysinf/aelmazaty [2023-06-13T16:02:17] Items: 1 [2023-06-13T16:02:17] Directories: 0 [2023-06-13T16:02:17] Files: 1 [2023-06-13T16:02:17] Links: 0 [2023-06-13T16:02:17] Data: 16.636 KiB (16.636 KiB per file) [2023-06-13T16:02:17] Creating 1 files. [2023-06-13T16:02:17] Copying data. [2023-06-13T16:02:17] Copy data: 16.636 KiB (17035 bytes) [2023-06-13T16:02:17] Copy rate: 1.207 MiB/s (17035 bytes in 0.013 seconds) [2023-06-13T16:02:17] Syncing data to disk. [2023-06-13T16:02:17] Sync completed in 0.020 seconds. [2023-06-13T16:02:17] Setting ownership, permissions, and timestamps. [2023-06-13T16:02:17] Updated 1 items in 0.003 seconds (298.208 items/sec) [2023-06-13T16:02:17] Syncing directory updates to disk. [2023-06-13T16:02:17] Sync completed in 0.001 seconds. [2023-06-13T16:02:17] Started: Jun-13-2023,16:02:17 [2023-06-13T16:02:17] Completed: Jun-13-2023,16:02:17 [2023-06-13T16:02:17] Seconds: 0.043 [2023-06-13T16:02:17] Items: 1 [2023-06-13T16:02:17] Directories: 0 [2023-06-13T16:02:17] Files: 1 [2023-06-13T16:02:17] Links: 0 [2023-06-13T16:02:17] Data: 16.636 KiB (17035 bytes) [2023-06-13T16:02:17] Rate: 391.203 KiB/s (17035 bytes in 0.043 seconds) [2023-06-13T16:02:17] Updating timestamps on newly copied files

It works normally without getting any errors.

I tried different openmpi versions. All installed via spack. The latest is 4.1.5. I get the same error on all of them.

Is that a know issue? How can I avoid these errors? Best regards, Ahmed