Does not converge with -N flag in mpirun

soumyadipghosh commented 4 years ago

The following run converges:

mpirun -np 8 ./benchmarking/bench_ras --executor=reference --num_iters=2000 --explicit_laplacian --set_1d_laplacian_size=128 --set_tol=1e-6 --local_tol=1e-12 --partition=regular --local_solver=iterative-ginkgo --enable_onesided --enable_flush=flush-local --write_comm_data --timings_file=subd

But the following doesn't:

mpirun -N 1 -np 8 ./benchmarking/bench_ras --executor=reference --num_iters=2000 --explicit_laplacian --set_1d_laplacian_size=128 --set_tol=1e-6 --local_tol=1e-12 --partition=regular --local_solver=iterative-ginkgo --enable_onesided --enable_flush=flush-local --write_comm_data --timings_file=subd

I am reserving appropriate number of PEs in the 2nd case, i.e., num_pe_pernode * 8. I am using Open MPI and Boost compiled with gcc/8.3.0.

pratikvn commented 4 years ago

Hm, I am not really sure what the problem is.

What output do you get ?
Does it just run until the max number of iterations (with the final residual norm being below the tolerance), or does it say it has diverged ?
What does the initial line about the number of ranks and threads say ?
Does the twosided converge ?
If you increase the number of iterations, does that help ?
Does changing the partition help ?

soumyadipghosh commented 4 years ago

Output:

 Problem Size: 16384
 Regular 1D partition
 Local iterative solve with Ginkgo CG 
 residual norm nan
 relative residual norm of solution nan
 Time taken for solve 597.05
 Did not converge in 2000 iterations.

Same problem with two-sided
Increasing the iterations to 5000 did not help
Regular2d (with 16 PEs) doesn't converge.

However, I noticed that changing the no of PEs from 8 to 4 made it converge, i.e., the following converged:

mpirun -N 1 -np 4 ./benchmarking/bench_ras --executor=reference --num_iters=2000 --explicit_laplacian --set_1d_laplacian_size=128 --set_tol=1e-6 --local_tol=1e-12 --partition=regular --local_solver=iterative-ginkgo --enable_onesided --enable_flush=flush-local --write_comm_data --timings_file=subd

Were you able to reproduce the problem ?

pratikvn commented 4 years ago

The -N flag and the -np flag may be conflicting. How many nodes have you reserved ? On my local system, this is what mpirun -h gives me:

-N <arg0>             Launch n processes per node on all allocated nodes
                         (synonym for npernode)
-n|--n <arg0>         Number of processes to run

So, I think if you reserve 4 nodes through your batch submission system, and put -N1 without the -np flag that is maybe sufficient ?

No, I dont have a system with qsub . With slurm I can just select the number of nodes and the tasks per node and submit it through srun. I dont have to use mpirun, so I dont have this problem.

soumyadipghosh commented 4 years ago

This is the description of mpirun on my system

-N <arg0>             Launch n processes per node on all allocated nodes
                         (synonym for 'map-by node')
-n|--n <arg0>         Number of processes to run

I removed the -np and it seems to converge now for the event-based branch with a threshold of 0. For non-zero thresholds, the problem persists! I am running more tests, will update what happens

pratikvn commented 4 years ago

If you want to check if your job distribution is as you intended, for example, 1 process per node and 4 nodes and not 4 processes on one single node, it might be easier to write a small standalone test with MPI, using MPI_Comm_split_type to get the ranks local to a node as such:

MPI_Comm local_comm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
                    MPI_INFO_NULL, &local_comm);
int local_rank;
MPI_Comm_rank(local_comm, &local_rank);

This way you can be sure of the rank placement. Here is the documentation for MPI_Comm_split_type.

If you want finer control and have OpenMPI, this is a good introduction. There should be something similar for other MPI implementations as well.

soumyadipghosh commented 4 years ago

I checked the job distribution by printing the hostname and every process has a different hostname with -N which confirms that they are assigned to different nodes. I ran more tests but the problem is still there!

Can you run the following test?

Take the latest event-based branch
Modify cpu_batch and run_script to suit your system, keeping the parameters for constant and gamma
First choose constant=0 and gamma=0 and run it, keeping the other parameters in run_script fixed. See if it converges (it does for me)
Then choose constant=1e-10 and gamma=0.95 and run it, keeping the other parameters in run_script fixed (This threshold is so low that communication should be triggered in every iteration, making it similar to the previous case). See if it converges (it doesn't for me!)

soumyadipghosh commented 4 years ago

I found that the residual values (and possibly solution values) are becoming nan on some processes. Is there some possibility of a divide-by-zero occuring?

pratikvn commented 4 years ago

With mpirun -N 1 and 4 nodes, so 4 subdomains, constant=1e-10 and gamma=0.95 and 0 and 0 converge for me .
With mpirun -np 8 and 1 node, as well, both converge.
With mpirun -N 1 and 8 nodes, as well, both converge.
With mpirun -N 4 and 2 nodes,as well both converge.

So, I dont seem to have a problem with convergence. I am not sure if the process distribution is proper on this system, but I will check tomorrow on Summit again.

pratikvn commented 4 years ago

After checking again on Summit, with non-zero constant and non-zero gamma, nothing converges for me. For zero constant and zero gamma, every configuration converges.

But the residual norm does not become nan, but nevertheless, I think there is some bug in the event based communication.

soumyadipghosh commented 4 years ago

Oh! With the same code converging in one system and not converging in another, it will be hard to debug!

Did you run the develop branch on Summit as well ? What does mpirun -N 1 for 8 nodes give there ? For me, the develop branch (from Feb 25) also does not converge, not sure about the latest develop branch.

soumyadipghosh commented 4 years ago

This issue has been resolved. The problem was probably fixed by two things:

Initialization of recv_buffer. Without initialization, it was taking a nan value.
Extrapolation at receiver. This probably ensured that there was no communication deadlock happening with event-based.

pratikvn / schwarz-lib

Does not converge with -N flag in mpirun #34