open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 859 forks source link

Lock files generated during parallel read of a NetCDF file #10053

Open dqwu opened 2 years ago

dqwu commented 2 years ago

What version of Open MPI are you using?

v4.1.1

Describe how Open MPI was installed

spack installation

Please describe the system on which you are running


Details of the problem

This issue occurs at a machine used by E3SM (e3sm.org) https://e3sm.org/model/running-e3sm/supported-machines/chrysalis-anl

The file system is GPFS. Multiple .loc files associated with the same NetCDF input file were generated from different users within a 12-min window.

-rw-r--r-- 1 ac.jgfouca    E3SM 8 Feb 17 23:27 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1115488256-2337493.loc
-rw-r--r-- 1 ac.ndkeen     E3SM 8 Feb 17 23:30 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117061120-2338509.loc
-rw-r--r-- 1 ac.onguba     E3SM 8 Feb 17 23:32 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117257728-2339568.loc
-rw-r--r-- 1 jayesh        E3SM 8 Feb 17 23:35 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117323264-2340199.loc
-rw-r--r-- 1 ac.brhillman  E3SM 8 Feb 17 23:37 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117454336-2340833.loc
-rw-r--r-- 1 wuda          E3SM 8 Feb 17 23:39 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1118240768-2341335.loc

We also saw some .locktest files generated, such as cami_mam3_Linoz_ne30np4_L72_c160214.nc.locktest.0 Most likely a race condition, as this issue is not always reproducible.

More information

modules used: intel/20.0.4-kodw73g intel-mkl/2020.4.304-g2qaxzf openmpi/4.1.1-qiqkjbu parallel-netcdf/1.11.0-go65een The tests were run with 1792 MPI tasks, 28 nodes (64 tasks per node). The parallel read code calls ncmpi_begin_indep_data() API of PnetCDF lib, which calls MPI_File_open() API of OpenMPI lib with a error code returned. 1536: MPI error (MPI_File_open) : MPI_ERR_OTHER: known error not in list

It has been confirmed that these lock files are created by OpenMPI code:

ompi/mca/sharedfp/lockedfile/sharedfp_lockedfile_file_open.c:
snprintf(lockedfilename, filenamelen, "%s-%u-%d%s",filename,masterjobid,int_pid,".lock");

ompi/mca/sharedfp/lockedfile/sharedfp_lockedfile.c:
sprintf(filename,"%s%s%d",fh->f_filename,".locktest.",rank);

As a workaround, E3SM developers have set the input directory /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme to be read-only. However, the similar issue occurred on another directory (/lcrc/group/e3sm/data/inputdata/atm/cam/topo) which is still writable.

Questions

Do you have some suggestions for this issue? Since the file system is GPFS, do you think setting ROMIO_GPFS_FREE_LOCKS ENV variable works?

ompi/mca/io/romio321/romio/adio/ad_gpfs/ad_gpfs_open.c
void ADIOI_GPFS_Open(ADIO_File fd, int *error_code)
{
...
#ifdef HAVE_GPFS_FCNTL_H
    /* in parallel workload, might be helpful to immediately release block
     * tokens.  Or, system call overhead will outweigh any benefits... */
    if (getenv("ROMIO_GPFS_FREE_LOCKS")!=NULL)
        gpfs_free_all_locks(fd->fd_sys);

#endif
...
}
edgargabriel commented 2 years ago

@dqwu thank you for the bug report. There is a chance that this issue is already resolved, there is a pending pr to fix a problem with removing lockfiles at the end. I hope it makes it into the v4.1.2 release

https://github.com/open-mpi/ompi/pull/10006

The ROMIO environment variable will not have any impact on the OMPIO components, those are two separate implementations of the MPI I/O operations in ompi. The sharedfp/lockedfile component is part of the OMPIO set of frameworks to implement MPI I/O.

jayeshkrishna commented 2 years ago

Update: We still have the same issue after upgrading to OpenMPI 4.1.2

edgargabriel commented 2 years ago

@jayeshkrishna yes, the fix didn't make it into v4.1.2, but is part of v4.1.3 which will probably be released later this week.

dqwu commented 2 years ago

@edgargabriel Looks like openmpi/4.1.3 didn't fully fix the issue on E3SM machine Chrysalis with lock files during read. E3SM developers had deleted all lock files in inputdata yesterday. This morning they were back.

It seems that the lock files for read are not deleted if an opening file call failed: ierr = pio_openfile(pio_subsystem, file, pio_iotype, fname, mode) // This calls PnetCDF open file API, which calls some MPI-IO APIs

Do you know possible workarounds to avoid these lock files even when some openmpi calls might fail?

FYI, below are how we configure OpenMPI 4.1.3 on Chrysalis.

$ /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/openmpi-4.1.3-pin4k7o/bin/ompi_info
                 Package: Open MPI svcbuilder@chrlogin1.lcrc.anl.gov
                          Distribution
                Open MPI: 4.1.3
  Open MPI repo revision: v4.1.3
   Open MPI release date: Mar 31, 2022
                Open RTE: 4.1.3
  Open RTE repo revision: v4.1.3
   Open RTE release date: Mar 31, 2022
                    OPAL: 4.1.3
      OPAL repo revision: v4.1.3
       OPAL release date: Mar 31, 2022
                 MPI API: 3.1.0
            Ident string: 4.1.3
                  Prefix: /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/openmpi-4.1.3-pin4k7o
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: chrlogin1.lcrc.anl.gov
           Configured by: svcbuilder
           Configured on: Thu Apr 14 19:50:43 UTC 2022
          Configure host: chrlogin1.lcrc.anl.gov
  Configure command line: '--prefix=/gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/openmpi-4.1.3-pin4k7o'
                          '--enable-shared' '--disable-silent-rules'
                          '--enable-mpi1-compatibility'
                          '--with-platform=contrib/platform/mellanox/optimized'
                          '--disable-builtin-atomics' '--with-pmi=/usr'
                          '--enable-static'
                          '--with-zlib=/gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/zlib-1.2.11-dudhhig'
                          '--enable-mpi1-compatibility' '--without-psm'
                          '--without-fca' '--without-cma'
                          '--with-knem=/opt/knem-1.1.4.90mlnx1'
                          '--without-mxm' '--without-ofi' '--without-psm2'
                          '--with-hcoll=/opt/mellanox/hcoll'
                          '--without-xpmem' '--without-verbs'
                          '--with-ucx=/usr' '--with-slurm' '--without-lsf'
                          '--without-alps' '--without-loadleveler'
                          '--without-sge' '--without-tm'
                          '--disable-memchecker'
                          '--with-hwloc=/gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/hwloc-2.4.1-22xfxgi'
                          '--disable-java' '--disable-mpi-java'
                          '--without-cuda' '--enable-wrapper-rpath'
                          '--disable-wrapper-runpath' '--enable-mpi-cxx'
                          '--disable-cxx-exceptions'
                          '--with-wrapper-ldflags=-Wl,-rpath,/gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/int
el-20.0.4-kodw73g/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin'

Note '--with-platform=contrib/platform/mellanox/optimized'. Our Mellanox HPCX version is v2.8.0

edgargabriel commented 2 years ago

you could try to set --mca sharedfp ^lockedfile.

yes, if something goes wrong (e.g. code is crashing), the lockfiles will not be cleaned up, and I am not sure that I am aware of an easy solution for this. I will try to think about it.

dqwu commented 2 years ago

@edgargabriel Thanks for the suggestion. Setting the following ENV variable should also work, right? export OMPI_MCA_sharedfp=^lockedfile

I have tested the above setting with the test case mentioned in #10297 No .lock file generated but there are hundreds files named xxx.nc.data.xxx and xxx.nc.metadata.xxx generated when the write failed. Is this expected result?

Update: I changed that test case to replay with less variables such that it can pass. The temp xxx.nc.data.xxx and xxx.nc.metadata.xxx files were still generated, but they were all deleted by Open MPI after file close (the write did not fail).

Do you have similar options for Open MPI to disable these data and metadata files?

edgargabriel commented 2 years ago

@dqwu yes, the environment variable is equivalent to the runtime parameter. Hm, I did not expect the individual component to kick-in in this case, but it looks like it has. Try to exclude both, the lockedfile and the individual component,e.g.

export OMPI_MCA_sharedfp=^lockedfile,individual

dqwu commented 2 years ago

@edgargabriel "export OMPI_MCA_sharedfp=^lockedfile,individual" seems to work, thanks.