hemelb-codes / hemelb

A high performance parallel lattice-Boltzmann code for large scale fluid flow in complex geometries
GNU Lesser General Public License v3.0
34 stars 11 forks source link

Unweighted Graphs Error with RBCs on ARCHER2 #815

Open c-denham opened 6 months ago

c-denham commented 6 months ago

Hello,

I have ran a test case with RBCs on ARCHER2 and have copied the slurm.out below.

![0.0s]Reading configuration from /work/e283/e283/cd3nham/config_files/rbc_tests/test_clipped.xml
![0.0s]RBC insertion random seed: 0x17b81088669b7379
![0.0s]Krueger format meshes are deprecated, move to VTK when you can.
![0.0s]Beginning Initialisation.
![0.0s]Loading and decomposing geometry file /work/e283/e283/cd3nham/config_files/rbc_tests/test_clipped.gmy.
![0.0s]Opened config file /work/e283/e283/cd3nham/config_files/rbc_tests/test_clipped.gmy
![0.1s]Creating block-level octree
![0.1s]Beginning initial decomposition
![0.9s]Optimising the domain decomposition.
![4.1s]Initialising domain.
![4.1s]Processing sites assigned to each MPI process
![4.3s]Assigning local indices to sites and associated data
![4.3s]Initialising neighbour lookups
![5.0s]Initialising field data.
![5.0s]Initialising neighbouring data manager.
![5.0s]Initialising LBM.
![5.0s]Initialising RBCs.
![5.0s]Krueger format meshes are deprecated, move to VTK when you can.
![5.0s]Computing which ranks are within a cell's size
![160.2s]Checking the neighbourhoods are self-consistent
![316.0s]Create the graph communicator
![316.0s]Creating coordinate to rank map
![317.2s]Beginning to run simulation.
[Rank 0000000, 317.2s, mem: 0052484]: Only support unweighted graphs
MPICH ERROR [Rank 0] [job id 5754290.0] [Wed Feb 28 15:35:53 2024] [nid003862] - Abort(-1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[Rank 0000001, 317.2s, mem: 0058032]: Only support unweighted graphs
MPICH ERROR [Rank 1] [job id 5754290.0] [Wed Feb 28 15:35:53 2024] [nid003862] - Abort(-1) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
MPICH ERROR [Rank 2] [job id 5754290.0] [Wed Feb 28 15:35:53 2024] [nid003862] - Abort(-1) (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
[Rank 0000003, 2.4s, mem: 0031436]: ParMetis cut 15884 edges.
[Rank 0000003, 317.2s, mem: 0050612]: Only support unweighted graphs
MPICH ERROR [Rank 3] [job id 5754290.0] [Wed Feb 28 15:35:53 2024] [nid003862] - Abort(-1) (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
[Rank 0000002, 317.2s, mem: 0061888]: Only support unweighted graphs
srun: error: nid003862: tasks 0-3: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=5754290.0

I do not have a copy of the output from when we compiled it @rupertnash last week but I recall the unweighted graphs error appearing during compiling but not during the fluid only test case. Many thanks in advance for your advice.

mobernabeu commented 6 months ago

This seems to be as if MpiCommunicator::DistGraphAdjacent is attempting to create an unweighted graph, but then when the graph is queried in MpiCommunicator::GetNeighborsCount, it appears to be weighted.

This may be an MPI implementation issue or us not using the creation interface correctly. It needs further investigation.

mobernabeu commented 6 months ago

A bit more digging shows that this (otherwise very sensible) check for weighted/unweighted only appeared when we moved from graph to distributed graph: https://github.com/hemelb-codes/hemelb/commit/9689943a8e780c2f2b47860dc43677abe54d0ae5

I don't see what we could be doing wrong in the distributed graph creation, so my suggestion would be to refine the logic of the check to assert that all weights are equal when the graph wrongly believes to be weighted (implementation issue?). Any thoughts @rupertnash ?

rupertnash commented 6 months ago

Can you put the case in the shared folder (/work/e283/e283/shared) so I can reproduce?

I'd like to run under debugger to investigate as this may be a bug in the MPI library (standard is clear on what should happen "false if MPI_UNWEIGHTED was supplied during creation, true otherwise"

c-denham commented 6 months ago

I have copied the config_files folder over to the shared space that should include everything you need @rupertnash to reproduce

mobernabeu commented 5 months ago

Hi @rupertnash, I was trying to help @c-denham make a bit of progress on this issue by investigating whether we can use any alternative MPI implementation potentially available in ARCHER2 (e.g. Open MPI). Looking through module avail doesn't show anything obvious. Do you know if there's one?

I also tried swapping the default programming environment from gnu to PrgEnv-cray or PrgEnv-aocc in the hope that other versions of MPICH might be built there, but that seems broken at the system level currently.

rupertnash commented 5 months ago

So I have investigated and think this is likely to be a bug in the MPI library. I've reported to Helpdesk who've passed to HPE's MPICH team. They have reproduced the behaviour in HemeLB and are trying to understand the problem.

I did not trigger the bug when running on a larger number of processors however, so maybe try that? Disabling the check is maybe OK, although if the communicators have been corrupted somehow (whether internally or by hemelb) then things may go wrong later...

mobernabeu commented 5 months ago

Thanks for investigating further, @rupertnash. We are gonna try running with a larger core count. How many did you got for?