Closed denisbertini closed 11 months ago
@denisbertini thank you. Short question, how many processes do you use to run the testcode? I do not have access to a lustre file system anymore, but I can see whether I can reproduce the hang on another file system.
The lock file would btw. be removed if the program finished correctly, its just there because the program doesn't terminate correctly
@edgargabriel for this test i used 512
MPI ranks, i can try using less ranks?
yes, the fewer the better, I have no way to debug at the moment a code that needs 512 processes
@edgargabriel I am afraid that this problem is closely related to Lustre filesystem,
there are ways to run the lustre collective component (dynamic_gen2) also on other file systems, so if its an algorithmic issue, I might be able to reproduce it somewhere else as well
@edgargabriel may be interesting for you: with lower mpi ranks counts ( 64
) the OMPIO program completed correctly ...
Could this be related to the Lustre file locking mechanism ?
Can you try one thing maybe? Does the program finish correctly for you if you force set the number of aggregators to the number of nodes that you use? E.g. if the 512 processes are running on lets say 8 nodes, you could set
mpirun --mca io_ompio_num_aggregators 8 -np 512 ./your executable
Could this be related to the Lustre file locking mechanism ?
Hm. Interesting. Lustre should actually not use locking at all, could be a bug.
ompi/mca/fs/lustre/fs_luster_file_open.c
fh->f_flags |= OMPIO_LOCK_NEVER;
Adding explcitely
export OMPI_MCA_io_ompio_num_aggregators=8
did not solved the problem when using 512
ranks
Still working for 64
processes though ...
I changed the distribution of jobs according to fixed number of nodes :
sbatch --nodes 8 --tasks-per-node 32 --ntasks-per-core 1 --no-requeue --job-name r_mpi --mem-per-cpu 4000 --mail-type ALL --mail-user d.bertini@gsi.de --partition debug --time 0-08:00:00 -D ./ -o %j.out.log -e %j.err.log -- ./run-file.sh
Adding the number of aggregators to the number of nodes.
Using this way the program completed even with 256
processes ...
interesting using similar distribution : 8
nodes * 62
processes/node = 512
, the program completed sucessfully without deadlock ...
I will try to debug over the weekend the lock setting algorithm in the fbtl/posix component and how the flags are being applied, something must be wrong there and could potentially be the reason for what you observe in this testcase. I might have to setup a lustre file system on my workstation for that.
ok perfect !
interesting using similar distribution :
8
nodes *62
processes/node =512
, the program completed sucessfully without deadlock ...
actually, apologies, its been a while that I worked on this part of the code. The lustre fcoll component (dynamic_gen2) already sets the number of aggregators to be the number of nodes used, so in theory the io_ompio_num_aggregators setting should not be required. I am getting more convinced that the locking is the culprit here.
I will also say so !
I was able to step through line by line for a simple example in the debugger, the locking is not the problem. The fbtl component performing the actual write does not perform a lock. It returns immediately from the fbtl_posix_lock() function without doing anything since the LOCK_NEVER flag is set on the file handle. That is the expected and correct behavior.
The reason you have a lock file is because of the shared file pointer component: ompio detected that the file system supports locking, and hence the 'lockedfile' sharedfp component has been selected. This component opens a lock file temporarily and closes it again on file_close().
I have one more suspicion on what could be the reason, but will need some time to dig into that a bit.
Humm. interesting. As far as i know ( correct me if i am wrong) , when writing to an unique shared file from multiple processes , every processes write in an separate segment which is pre-allocated. In principle there is no need of lock in such a case. Why is the 'lockedfile' sharedfp pointer then selected ?
This is for the MPI_File_write_shared operations (and friends). Your example does not use these operations, but in MPI_File_open we don't know which functions will be used, and hence we have to prepare everything even for these operations.
Is there any further investigations done about that problem ?
I noticed that the same problem occured in the latest release version v4.1.5
.
I have unfortunately no additional insights. There are no differences between the parallel I/O code in 4.1.4 and 4.1.5. It is a difficult bug to hunt down since it only occurs on a relatively large process count, and it works if you distribute the processes evenly on the nodes (which really should only have an impact on what transport processes are using to communicate to each other).
The only suggestion/idea that I had maybe looking at the ticket again, maybe you could try to compile Open MPI with internal libevent and hwloc to see whether that makes a difference. I doubt it, but it might be worth a try. Also, I would not disable cma for example, if that is there it will help tremendously with the communication performance
I made some few more tests with the same simple program from above in this issue,
and found out that if i use higher stripes count ( lfs setstripe -c -1
) on the output directory,
the .lock
file does not appeared and the job completed.
Does it gives a hint ?
no, unfortunately not really. I looked into the lock files, but I am pretty sure that they have nothing to do with the problem that you see
you meant the problem could come from the lustre filesystem and not ompio ?
no, not necessarily. I think the part that is confusing me is that when you ran the job with a slightly different distribution of processes, e.g. your comment from Dec. 8:
interesting using similar distribution : 8 nodes * 62 processes/node = 512 , the program completed sucessfully without deadlock ...
it worked. Internally in ompio, there is absolutely no difference in what happens from the algorithmic perspective between this process distribution vs. the other one. The only difference is how processes communicate with each other, e.g. there might be some processes that communicate through shared memory vs. inter-node network. So there are slight differences in the timing on how things happen, resp. the sequence of things. I think this is a network stack issue, but really difficult to narrow down or reproduce. That was the reason that I suggested trying the internal libevent and maybe not disable cma and similar.
i am not sure i understand what you mean by internal libevent and cma not disabled. What should i modify n openMPI to run with this kind of configuration and what it is supposed to test?
its in the configure line of Open MPI: the line that you provided on the top of the ticket shows how it was configured, and containes disabling a lot of components, e.g.
... --without-xpmem' '--without-cma ...
In the configure line I would add --with-libevent=internal --with-hwloc=internal
and not set the --without-cma
what will this new configuration test/prove ?
its simply a test to see whether any of the parameters have an influence for this scenario.
i realised that in our case the lustre module in OMPIO was not activated/compiled and only the generic linux FS was used internally in openMPI.
we will now add the lustreapi
headers in order to activate it and i will do the tests again.
This issue is now solved with openMPI 5.0.0. You can close this issue.
Background information
OMPIO I/O module ( default ) hangs when writing single shared file from multiple MPI processes on Lustre filesystem.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installation (spack) detailed:
Please describe the system on which you are running
4.18.0-372.26.1.el8_6.x86_64
Details of the problem
Problem identical to the one described my already posted issue In order to go further on that issue, i created a simple program that reproduce systematically this I/O problem:
This problem can be reproduced easily using this simple program:
When using the
ROMIO
module by enforcing:the program works as expected and scales nicely with the number of MPI ranks. When using the now defaulted OMPIO I/O module the program hangs when trying to dump the first shared file and a additional
.lock
file is created :Any idea what could go wrong with the
OMPIO
default module ?