Closed soumyadipghosh closed 4 years ago
Hm, I am not really sure what the problem is.
Problem Size: 16384
Regular 1D partition
Local iterative solve with Ginkgo CG
residual norm nan
relative residual norm of solution nan
Time taken for solve 597.05
Did not converge in 2000 iterations.
5000
did not helpRegular2d (with 16 PEs
) doesn't converge.
However, I noticed that changing the no of PEs from 8 to 4 made it converge, i.e., the following converged:
mpirun -N 1 -np 4 ./benchmarking/bench_ras --executor=reference --num_iters=2000 --explicit_laplacian --set_1d_laplacian_size=128 --set_tol=1e-6 --local_tol=1e-12 --partition=regular --local_solver=iterative-ginkgo --enable_onesided --enable_flush=flush-local --write_comm_data --timings_file=subd
Were you able to reproduce the problem ?
The -N
flag and the -np
flag may be conflicting. How many nodes have you reserved ? On my local system, this is what mpirun -h
gives me:
-N <arg0> Launch n processes per node on all allocated nodes
(synonym for npernode)
-n|--n <arg0> Number of processes to run
So, I think if you reserve 4 nodes through your batch submission system, and put -N1
without the -np
flag that is maybe sufficient ?
No, I dont have a system with qsub
. With slurm I can just select the number of nodes and the tasks per node and submit it through srun. I dont have to use mpirun, so I dont have this problem.
This is the description of mpirun on my system
-N <arg0> Launch n processes per node on all allocated nodes
(synonym for 'map-by node')
-n|--n <arg0> Number of processes to run
I removed the -np
and it seems to converge now for the event-based
branch with a threshold of 0
. For non-zero thresholds, the problem persists! I am running more tests, will update what happens
If you want to check if your job distribution is as you intended, for example, 1 process per node and 4 nodes and not 4 processes on one single node, it might be easier to write a small standalone test with MPI, using MPI_Comm_split_type
to get the ranks local to a node as such:
MPI_Comm local_comm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &local_comm);
int local_rank;
MPI_Comm_rank(local_comm, &local_rank);
This way you can be sure of the rank placement. Here is the documentation for MPI_Comm_split_type
.
If you want finer control and have OpenMPI, this is a good introduction. There should be something similar for other MPI implementations as well.
I checked the job distribution by printing the hostname and every process has a different hostname with -N
which confirms that they are assigned to different nodes. I ran more tests but the problem is still there!
Can you run the following test?
cpu_batch
and run_script
to suit your system, keeping the parameters for constant
and gamma
constant=0
and gamma=0
and run it, keeping the other parameters in run_script
fixed. See if it converges (it does for me)constant=1e-10
and gamma=0.95
and run it, keeping the other parameters in run_script
fixed (This threshold is so low that communication should be triggered in every iteration, making it similar to the previous case). See if it converges (it doesn't for me!)I found that the residual values (and possibly solution values) are becoming nan
on some processes. Is there some possibility of a divide-by-zero occuring?
With mpirun -N 1
and 4 nodes, so 4 subdomains, constant=1e-10
and gamma=0.95
and 0
and 0
converge for me .
With mpirun -np 8
and 1 node, as well, both converge.
With mpirun -N 1
and 8 nodes, as well, both converge.
With mpirun -N 4
and 2 nodes,as well both converge.
So, I dont seem to have a problem with convergence. I am not sure if the process distribution is proper on this system, but I will check tomorrow on Summit again.
After checking again on Summit, with non-zero constant and non-zero gamma, nothing converges for me. For zero constant and zero gamma, every configuration converges.
But the residual norm does not become nan
, but nevertheless, I think there is some bug in the event based communication.
Oh! With the same code converging in one system and not converging in another, it will be hard to debug!
Did you run the develop branch on Summit as well ? What does mpirun -N 1
for 8 nodes give there ? For me, the develop branch (from Feb 25) also does not converge, not sure about the latest develop branch.
This issue has been resolved. The problem was probably fixed by two things:
The following run converges:
mpirun -np 8 ./benchmarking/bench_ras --executor=reference --num_iters=2000 --explicit_laplacian --set_1d_laplacian_size=128 --set_tol=1e-6 --local_tol=1e-12 --partition=regular --local_solver=iterative-ginkgo --enable_onesided --enable_flush=flush-local --write_comm_data --timings_file=subd
But the following doesn't:
mpirun -N 1 -np 8 ./benchmarking/bench_ras --executor=reference --num_iters=2000 --explicit_laplacian --set_1d_laplacian_size=128 --set_tol=1e-6 --local_tol=1e-12 --partition=regular --local_solver=iterative-ginkgo --enable_onesided --enable_flush=flush-local --write_comm_data --timings_file=subd
I am reserving appropriate number of PEs in the 2nd case, i.e.,
num_pe_pernode * 8
. I am using Open MPI and Boost compiled with gcc/8.3.0.