openMPI 4.1.4 + UCX 1.13.1 OMPIO default module hangs writing shared file on Lustre

denisbertini commented 1 year ago

Background information

OMPIO I/O module ( default ) hangs when writing single shared file from multiple MPI processes on Lustre filesystem.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

openMPI 4.1 Release + UCX 1.11 Release

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installation (spack) detailed:

 Configure command line: '--prefix=/cvmfs/vae.gsi.de/centos7/virgo2/spack-0.19/opt/linux-centos7-x86_64/gcc-10.2.0/openmpi-4.1.4-rgadj3kn2z7x7ds6vyudnveayjodjpbo' '--enable-shared' '--disable-silent-rules' '--disable-builtin-atomics' '--without-pmi' '--enable-static' '--enable-mpi1-compatibility' '--without-mxm' '--without-ofi' '--without-xpmem' '--without-cma' '--without-hcoll' '--without-psm2' '--without-knem' '--without-psm' '--without-verbs' '--without-fca' '--with-ucx=/cvmfs/vae.gsi.de/centos7/virgo2/spack-0.19/opt/linux-centos7-x86_64/gcc-10.2.0/ucx-1.13.1-u4x64gmk2nyjed2hcwdsnv2aot7xeb2i' '--without-cray-xpmem' '--with-slurm' '--without-loadleveler' '--without-lsf' '--without-tm' '--without-alps' '--without-sge' '--disable-memchecker' '--with-libevent=/cvmfs/vae.gsi.de/centos7/virgo2/spack-0.19/opt/linux-centos7-x86_64/gcc-10.2.0/libevent-2.1.12-oiclaisi7j7kc7q44gg6qzj2yywgoms7' '--with-pmix=/cvmfs/vae.gsi.de/centos7/virgo2/spack-0.19/opt/linux-centos7-x86_64/gcc-10.2.0/pmix-3.2.2-4y2y2ddhizm5npngzwedmnty4lwcl432' '--with-zlib=/cvmfs/vae.gsi.de/centos7/virgo2/spack-0.19/opt/linux-centos7-x86_64/gcc-10.2.0/zlib-1.2.13-a4nwltikfmq7a524a7kcjof7tuarv7e5' '--with-hwloc=/cvmfs/vae.gsi.de/centos7/virgo2/spack-0.19/opt/linux-centos7-x86_64/gcc-10.2.0/hwloc-2.8.0-mrg5pqbsn7r2escs5cl7bxdyoehoxbu2' '--disable-java' '--disable-mpi-java' '--with-gpfs=no' '--without-cuda' '--enable-wrapper-rpath' '--disable-wrapper-runpath' '--disable-mpi-cxx' '--disable-cxx-exceptions' '--with-wrapper-ldflags=-Wl,-rpath,/cvmfs/vae.gsi.de/centos7/virgo2/spack-0.19/opt/linux-centos7-x86_64/gcc-4.8.5/gcc-10.2.0-hepooprvj2j4px24zfeepan6aqaintjb/lib/gcc/x86_64-pc-linux-gnu/10.2.0 -Wl,-rpath,/cvmfs/vae.gsi.de/centos7/virgo2/spack-0.19/opt/linux-centos7-x86_64/gcc-4.8.5/gcc-10.2.0-hepooprvj2j4px24zfeepan6aqaintjb/lib64'

Please describe the system on which you are running

Operating system/version: RockyLinux 8, Kernel: 4.18.0-372.26.1.el8_6.x86_64
Computer hardware: EPYC, AMD , NUMA, Memory: 503 GB
Network type: Infiniband
Filesystem: Lustre: 2.12.6 Release

Details of the problem

Problem identical to the one described my already posted issue In order to go further on that issue, i created a simple program that reproduce systematically this I/O problem:

This problem can be reproduced easily using this simple program:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

const int SIZE=10000000;
const double mib=1024.*1024.; 

int main(int argc, char **argv) {
    char filename[80];
    int i, err, cmode, rank, mpi_size;
    double buf[SIZE];
    MPI_Offset offset;
    MPI_File fh;
    MPI_Status status;

    MPI_Init(&argc, &argv);

    err = MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
    err = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    //printf(" rank: %d \n", rank);   

    double start = MPI_Wtime();

    for (int t= 0; t<10 ; t++) {  
      sprintf(filename,"output_%d.sdf", t);

      cmode = MPI_MODE_CREATE | MPI_MODE_RDWR;
      err = MPI_File_open(MPI_COMM_WORLD, filename, cmode, MPI_INFO_NULL, &fh);

      for (i=0; i<SIZE; i++) buf[i] = SIZE * 1. * rank + i * 1.0;

      offset = (MPI_Offset)rank * SIZE * sizeof(double);
      err = MPI_File_set_view(fh, offset, MPI_INT, MPI_INT, "native", MPI_INFO_NULL);

      err = MPI_File_write_all(fh, &buf[0], SIZE, MPI_DOUBLE, &status);

      //    printf(" Volume: %d MiB data written rk: %d\n", dvol, rank);

      err = MPI_File_close(&fh);

      if (0 == rank ){
    double dvol = SIZE*sizeof(double)/mib;
    double end = MPI_Wtime();
    printf("timestep: %d Total Data written: %f MiBytes in %f secs\n", t, mpi_size*dvol, end - start);
      }
    }

    MPI_Finalize();

    return 0;
}

When using the ROMIO module by enforcing:

export OMPI_MCA_io=romio321

the program works as expected and scales nicely with the number of MPI ranks. When using the now defaulted OMPIO I/O module the program hangs when trying to dump the first shared file and a additional .lock file is created :

dbertini@lxbk1130:/lustre/rz/dbertini/ompi_tests/simple > ls -al
total 337K
drwxr-sr-x. 2 dbertini  36K Dec  8 13:44 ./
drwxr-sr-x. 8 dbertini  12K Dec  8 12:49 ../
-rw-r--r--. 1 dbertini 1.1M Dec  8 13:46 100.err.log
-rw-r--r--. 1 dbertini  54K Dec  8 13:44 100.out.log
-rw-r--r--. 1 dbertini 1.1M Dec  8 13:38 99.err.log
-rw-r--r--. 1 dbertini  55K Dec  8 13:42 99.out.log
-rw-r--r--. 1 dbertini 1.4K Dec  8 13:31 mpi_testio.c
-rw-r--r--. 1 dbertini    0 Dec  8 13:44 output_0.sdf
-rw-r--r--. 1 dbertini    8 Dec  8 13:44 output_0.sdf-928983758-1108746.lock
-rwxr-xr-x. 1 dbertini  252 Dec  8 13:43 run-file.sh*
-rwxr-xr-x. 1 dbertini  203 Dec  8 12:54 submit.sh*
-rwxr-xr-x. 1 dbertini  13K Dec  8 13:23 test*

Any idea what could go wrong with the OMPIO default module ?

edgargabriel commented 1 year ago

@denisbertini thank you. Short question, how many processes do you use to run the testcode? I do not have access to a lustre file system anymore, but I can see whether I can reproduce the hang on another file system.

edgargabriel commented 1 year ago

The lock file would btw. be removed if the program finished correctly, its just there because the program doesn't terminate correctly

denisbertini commented 1 year ago

@edgargabriel for this test i used 512 MPI ranks, i can try using less ranks?

edgargabriel commented 1 year ago

yes, the fewer the better, I have no way to debug at the moment a code that needs 512 processes

denisbertini commented 1 year ago

@edgargabriel I am afraid that this problem is closely related to Lustre filesystem,

edgargabriel commented 1 year ago

there are ways to run the lustre collective component (dynamic_gen2) also on other file systems, so if its an algorithmic issue, I might be able to reproduce it somewhere else as well

denisbertini commented 1 year ago

@edgargabriel may be interesting for you: with lower mpi ranks counts ( 64) the OMPIO program completed correctly ...

denisbertini commented 1 year ago

Could this be related to the Lustre file locking mechanism ?

edgargabriel commented 1 year ago

Can you try one thing maybe? Does the program finish correctly for you if you force set the number of aggregators to the number of nodes that you use? E.g. if the 512 processes are running on lets say 8 nodes, you could set

mpirun --mca io_ompio_num_aggregators 8 -np 512 ./your executable

edgargabriel commented 1 year ago

Could this be related to the Lustre file locking mechanism ?

Hm. Interesting. Lustre should actually not use locking at all, could be a bug.

ompi/mca/fs/lustre/fs_luster_file_open.c

 fh->f_flags |= OMPIO_LOCK_NEVER;

denisbertini commented 1 year ago

Adding explcitely

export OMPI_MCA_io_ompio_num_aggregators=8

did not solved the problem when using 512 ranks Still working for 64 processes though ...

denisbertini commented 1 year ago

I changed the distribution of jobs according to fixed number of nodes :

sbatch --nodes 8 --tasks-per-node 32 --ntasks-per-core 1 --no-requeue --job-name r_mpi --mem-per-cpu 4000 --mail-type ALL --mail-user d.bertini@gsi.de --partition debug --time 0-08:00:00 -D ./ -o %j.out.log -e %j.err.log   -- ./run-file.sh

Adding the number of aggregators to the number of nodes. Using this way the program completed even with 256 processes ...

denisbertini commented 1 year ago

interesting using similar distribution : 8 nodes * 62 processes/node = 512 , the program completed sucessfully without deadlock ...

edgargabriel commented 1 year ago

I will try to debug over the weekend the lock setting algorithm in the fbtl/posix component and how the flags are being applied, something must be wrong there and could potentially be the reason for what you observe in this testcase. I might have to setup a lustre file system on my workstation for that.

denisbertini commented 1 year ago

ok perfect !

edgargabriel commented 1 year ago

interesting using similar distribution : 8 nodes * 62 processes/node = 512 , the program completed sucessfully without deadlock ...

actually, apologies, its been a while that I worked on this part of the code. The lustre fcoll component (dynamic_gen2) already sets the number of aggregators to be the number of nodes used, so in theory the io_ompio_num_aggregators setting should not be required. I am getting more convinced that the locking is the culprit here.

denisbertini commented 1 year ago

I will also say so !

edgargabriel commented 1 year ago

I was able to step through line by line for a simple example in the debugger, the locking is not the problem. The fbtl component performing the actual write does not perform a lock. It returns immediately from the fbtl_posix_lock() function without doing anything since the LOCK_NEVER flag is set on the file handle. That is the expected and correct behavior.

The reason you have a lock file is because of the shared file pointer component: ompio detected that the file system supports locking, and hence the 'lockedfile' sharedfp component has been selected. This component opens a lock file temporarily and closes it again on file_close().

I have one more suspicion on what could be the reason, but will need some time to dig into that a bit.

denisbertini commented 1 year ago

Humm. interesting. As far as i know ( correct me if i am wrong) , when writing to an unique shared file from multiple processes , every processes write in an separate segment which is pre-allocated. In principle there is no need of lock in such a case. Why is the 'lockedfile' sharedfp pointer then selected ?

edgargabriel commented 1 year ago

This is for the MPI_File_write_shared operations (and friends). Your example does not use these operations, but in MPI_File_open we don't know which functions will be used, and hence we have to prepare everything even for these operations.

denisbertini commented 1 year ago

Is there any further investigations done about that problem ? I noticed that the same problem occured in the latest release version v4.1.5.

edgargabriel commented 1 year ago

I have unfortunately no additional insights. There are no differences between the parallel I/O code in 4.1.4 and 4.1.5. It is a difficult bug to hunt down since it only occurs on a relatively large process count, and it works if you distribute the processes evenly on the nodes (which really should only have an impact on what transport processes are using to communicate to each other).

The only suggestion/idea that I had maybe looking at the ticket again, maybe you could try to compile Open MPI with internal libevent and hwloc to see whether that makes a difference. I doubt it, but it might be worth a try. Also, I would not disable cma for example, if that is there it will help tremendously with the communication performance

denisbertini commented 1 year ago

I made some few more tests with the same simple program from above in this issue, and found out that if i use higher stripes count ( lfs setstripe -c -1) on the output directory, the .lock file does not appeared and the job completed. Does it gives a hint ?

edgargabriel commented 1 year ago

no, unfortunately not really. I looked into the lock files, but I am pretty sure that they have nothing to do with the problem that you see

denisbertini commented 1 year ago

you meant the problem could come from the lustre filesystem and not ompio ?

edgargabriel commented 1 year ago

no, not necessarily. I think the part that is confusing me is that when you ran the job with a slightly different distribution of processes, e.g. your comment from Dec. 8: interesting using similar distribution : 8 nodes * 62 processes/node = 512 , the program completed sucessfully without deadlock ...

it worked. Internally in ompio, there is absolutely no difference in what happens from the algorithmic perspective between this process distribution vs. the other one. The only difference is how processes communicate with each other, e.g. there might be some processes that communicate through shared memory vs. inter-node network. So there are slight differences in the timing on how things happen, resp. the sequence of things. I think this is a network stack issue, but really difficult to narrow down or reproduce. That was the reason that I suggested trying the internal libevent and maybe not disable cma and similar.

denisbertini commented 1 year ago

i am not sure i understand what you mean by internal libevent and cma not disabled. What should i modify n openMPI to run with this kind of configuration and what it is supposed to test?

edgargabriel commented 1 year ago

its in the configure line of Open MPI: the line that you provided on the top of the ticket shows how it was configured, and containes disabling a lot of components, e.g. ... --without-xpmem' '--without-cma ...

In the configure line I would add --with-libevent=internal --with-hwloc=internal and not set the --without-cma

denisbertini commented 1 year ago

what will this new configuration test/prove ?

edgargabriel commented 1 year ago

its simply a test to see whether any of the parameters have an influence for this scenario.

denisbertini commented 1 year ago

i realised that in our case the lustre module in OMPIO was not activated/compiled and only the generic linux FS was used internally in openMPI. we will now add the lustreapi headers in order to activate it and i will do the tests again.

denisbertini commented 11 months ago

This issue is now solved with openMPI 5.0.0. You can close this issue.

open-mpi / ompi