open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

Corrupted data using parallel hdf5 #12718

Open tpadioleau opened 1 month ago

tpadioleau commented 1 month ago

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Using spack 0.22.1

Please describe the system on which you are running


Details of the problem

I am using parallel hdf5 to write a 2D distributed array. If I pass a cartesian communicator to hdf5, I sometimes notice that the dataset in the hdf5 file is corrupted when using 3 processes. You can find attached (hdf5_reproducer.tar.gz) a small reproducer in C (< 100 LOC) with a hdf5 file I got running the reproducer. You will also find the result of the ompi_info command.

Without understanding the logic behind, I also noticed different situations where I seem to never get corrupted data:

Thank you, Thomas

edgargabriel commented 1 month ago

@tpadioleau thank you for the bug report, what file system is this on? I will have a look in the next few days, and a reproducer is definitely super helpful

edgargabriel commented 1 month ago

we do have an optimization in the code specifically for cartesian communicators, I am wondering whether something in that logic is slightly off for 3 process, which is a bit of an unusual number for cartesian communicators.

tpadioleau commented 1 month ago

@tpadioleau thank you for the bug report, what file system is this on? I will have a look in the next few days, and a reproducer is definitely super helpful

I have just edited the issue to add the missing archive. I am working on my laptop, no parallel filesystem. I can also mention that I was not able to reproduce the error on this supercomputer https://mesocentre.pages.centralesupelec.fr/user_doc/ruche/01_cluster_overview with Open MPI.

we do have an optimization in the code specifically for cartesian communicators, I am wondering whether something in that logic is slightly off for 3 process, which is a bit of an unusual number for cartesian communicators.

I could also try with 4 processes and it also gives corrupted results after a few attempts.

edgargabriel commented 1 month ago

I did some preliminary analysis of this issue, and I am not yet sure what to make of it. I ran the testcode with 3 processes on my local workstation using the romio component as a reference, and all relevant collective components of ompio for comparison. The output file is according to h5diff always identical, so either they are all wrong (including romio on Open MPI), or all correct. Please note, that they do not match the sample h5 file that was in the tar file, not sure whether that was supposed to be an example for the correct output, or for an erroneous one.

$mpirun --mca io romio341  -np 3 ./a.out
$mv distributed_array.h5 distributed_array.h5.romio
$mpirun --mca fcoll individual -np 3 ./a.out 
$mv distributed_array.h5 distributed_array.h5.individual
$ h5diff -v distributed_array.h5.individual distributed_array.h5.romio
file1     file2
---------------------------------------
    x      x    /              
    x      x    /distributed_array

group  : </> and </>
0 differences found
dataset: </distributed_array> and </distributed_array>
0 differences found

$mpirun --mca fcoll vulcan -np 3 ./a.out 
$ mv distributed_array.h5 distributed_array.h5.vulcan
$ h5diff -v distributed_array.h5.vulcan distributed_array.h5.romio
file1     file2
---------------------------------------
    x      x    /              
    x      x    /distributed_array

group  : </> and </>
0 differences found
dataset: </distributed_array> and </distributed_array>
0 differences found

$mpirun --mca fcoll dynamic_gen2 -np 3 ./a.out 
$ mv distributed_array.h5 distributed_array.h5.dynamic_gen2
$ h5diff -v distributed_array.h5.dynamic_gen2 distributed_array.h5.romio

file1     file2
---------------------------------------
    x      x    /              
    x      x    /distributed_array

group  : </> and </>
0 differences found
dataset: </distributed_array> and </distributed_array>
0 differences found
tpadioleau commented 1 month ago

Thank you for your time. The difficulty is that I also get the correct result from time to time. The sample h5 in the tar is an example of an erroneous result.

If you are interested, I can try to generate a docker image to get closer to my environment ?

tpadioleau commented 1 month ago

Here is an archive environment.zip that contains a Dockerfile that was generated (slightly modified to create a toto user) from a spack.yaml environment.

Inside the container and mounting the directory that contains the reproducer in /src, I build with cmake -B build -S /src && cmake --build build. Then repeating multiple times the commands rm -f distributed_array.h5 && mpirun -np 3 build/main && h5dump distributed_array.h5, I see that the content changes.

The expected result with 3 mpi processes should look like this

HDF5 "distributed_array.h5" {
GROUP "/" {
   DATASET "distributed_array" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 3, 5 ) / ( 3, 5 ) }
      DATA {
      (0,0): 1, 1, 1, 1, 1,
      (1,0): 2, 2, 2, 2, 2,
      (2,0): 3, 3, 3, 3, 3
      }
   }
}
}
edgargabriel commented 1 month ago

Ok, I can confirm that inside of the docker image I can reproduce the issue with the fcoll/vulcan component. Using the other fcoll components (i.e. individual, dynamic_gen2) produces the correct output. So the question is why is that occurring. I noticed that the precompiled Open MPI library in the docker image is configured quite differently than what I usually do, I am wondering whether one of them is contributing to this (e.g. --disable-builtin-atomics, --without-cma, --enable-mpi1-compatibility). I will look into this later this week.

edgargabriel commented 1 month ago

I know what is triggering the issue. I just need to decide whether an if-statement in the code is erroneous or whether I need to add some locking protection around a particular write operation. Both make the test pass reliably, but since the code that includes the if-statement in question was written many years ago, I don't remember all the details (which would be important to decide whether the if-statement is erroneous or not). Either way, it is a legitimate bug, not a fluke or configure option issue.

tpadioleau commented 1 month ago

That is good news, thank you!

edgargabriel commented 1 month ago

luckily the commit message from 5 years ago was helpful, the if-statement is correct in that it does what it was supposed to do.

tpadioleau commented 1 month ago

Do you know if the bug can affect other communication/write operations ?

edgargabriel commented 1 month ago

yes, it could, but it depends on the file system how likely it is. I will have a fix ready either later today or tomorrow, and I will backport it to both 5.0.x and 4.1.x series

edgargabriel commented 3 weeks ago

@tpadioleau I filed a pr that fixes the issue. I spent quite some time thinking about the issue and the various options, I am 99% sure that real application scenario will not hit this problem. Part of the reason why you saw this error is actually because the data volume is so small that it all ended up in a single file system block, which caused the inconsistency. In a real life application scenario with data volumes are not this tiny, I don't think this issue would have occurred. That being said, we still want to fix it, hence the PR.

tpadioleau commented 1 week ago

@tpadioleau I filed a pr that fixes the issue. I spent quite some time thinking about the issue and the various options, I am 99% sure that real application scenario will not hit this problem. Part of the reason why you saw this error is actually because the data volume is so small that it all ended up in a single file system block, which caused the inconsistency. In a real life application scenario with data volumes are not this tiny, I don't think this issue would have occurred. That being said, we still want to fix it, hence the PR.

You may be right, I only noticed this issue when developing on my laptop hence with small test cases. Thank you again for your time!