pratikvn / schwarz-lib

Repository for testing asynchronous schwarz methods.
https://pratikvn.github.io/schwarz-lib/
BSD 3-Clause "New" or "Revised" License
5 stars 3 forks source link

Does not converge with -N flag in mpirun #34

Closed soumyadipghosh closed 4 years ago

soumyadipghosh commented 4 years ago

The following run converges:

mpirun -np 8 ./benchmarking/bench_ras --executor=reference --num_iters=2000 --explicit_laplacian --set_1d_laplacian_size=128 --set_tol=1e-6 --local_tol=1e-12 --partition=regular --local_solver=iterative-ginkgo --enable_onesided --enable_flush=flush-local --write_comm_data --timings_file=subd

But the following doesn't:

mpirun -N 1 -np 8 ./benchmarking/bench_ras --executor=reference --num_iters=2000 --explicit_laplacian --set_1d_laplacian_size=128 --set_tol=1e-6 --local_tol=1e-12 --partition=regular --local_solver=iterative-ginkgo --enable_onesided --enable_flush=flush-local --write_comm_data --timings_file=subd

I am reserving appropriate number of PEs in the 2nd case, i.e., num_pe_pernode * 8. I am using Open MPI and Boost compiled with gcc/8.3.0.

pratikvn commented 4 years ago

Hm, I am not really sure what the problem is.

soumyadipghosh commented 4 years ago
 Problem Size: 16384
 Regular 1D partition
 Local iterative solve with Ginkgo CG 
 residual norm nan
 relative residual norm of solution nan
 Time taken for solve 597.05
 Did not converge in 2000 iterations.

mpirun -N 1 -np 4 ./benchmarking/bench_ras --executor=reference --num_iters=2000 --explicit_laplacian --set_1d_laplacian_size=128 --set_tol=1e-6 --local_tol=1e-12 --partition=regular --local_solver=iterative-ginkgo --enable_onesided --enable_flush=flush-local --write_comm_data --timings_file=subd

Were you able to reproduce the problem ?

pratikvn commented 4 years ago

The -N flag and the -np flag may be conflicting. How many nodes have you reserved ? On my local system, this is what mpirun -h gives me:

-N <arg0>             Launch n processes per node on all allocated nodes
                         (synonym for npernode)
-n|--n <arg0>         Number of processes to run

So, I think if you reserve 4 nodes through your batch submission system, and put -N1 without the -np flag that is maybe sufficient ?

No, I dont have a system with qsub . With slurm I can just select the number of nodes and the tasks per node and submit it through srun. I dont have to use mpirun, so I dont have this problem.

soumyadipghosh commented 4 years ago

This is the description of mpirun on my system

-N <arg0>             Launch n processes per node on all allocated nodes
                         (synonym for 'map-by node')
-n|--n <arg0>         Number of processes to run

I removed the -np and it seems to converge now for the event-based branch with a threshold of 0. For non-zero thresholds, the problem persists! I am running more tests, will update what happens

pratikvn commented 4 years ago

If you want to check if your job distribution is as you intended, for example, 1 process per node and 4 nodes and not 4 processes on one single node, it might be easier to write a small standalone test with MPI, using MPI_Comm_split_type to get the ranks local to a node as such:

MPI_Comm local_comm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
                    MPI_INFO_NULL, &local_comm);
int local_rank;
MPI_Comm_rank(local_comm, &local_rank);

This way you can be sure of the rank placement. Here is the documentation for MPI_Comm_split_type.

If you want finer control and have OpenMPI, this is a good introduction. There should be something similar for other MPI implementations as well.

soumyadipghosh commented 4 years ago

I checked the job distribution by printing the hostname and every process has a different hostname with -N which confirms that they are assigned to different nodes. I ran more tests but the problem is still there!

Can you run the following test?

soumyadipghosh commented 4 years ago

I found that the residual values (and possibly solution values) are becoming nan on some processes. Is there some possibility of a divide-by-zero occuring?

pratikvn commented 4 years ago

So, I dont seem to have a problem with convergence. I am not sure if the process distribution is proper on this system, but I will check tomorrow on Summit again.

pratikvn commented 4 years ago

After checking again on Summit, with non-zero constant and non-zero gamma, nothing converges for me. For zero constant and zero gamma, every configuration converges.

But the residual norm does not become nan, but nevertheless, I think there is some bug in the event based communication.

soumyadipghosh commented 4 years ago

Oh! With the same code converging in one system and not converging in another, it will be hard to debug!

Did you run the develop branch on Summit as well ? What does mpirun -N 1 for 8 nodes give there ? For me, the develop branch (from Feb 25) also does not converge, not sure about the latest develop branch.

soumyadipghosh commented 4 years ago

This issue has been resolved. The problem was probably fixed by two things: