Closed jchelly closed 4 years ago
Given you were running on 48 nodes, do you expect 144 instances of now building ... Ie are there 144 MPI processes?
Sorry, that was a mistake. I'm using 48 MPI processes on 24 nodes so I think 144 instances of the message is correct if velociraptor has run three times.
Can you point to the log file (on cosma I presume)? The hang would be because presumably and MPI send/recv didn't complete
The log is in /cosma7/data/dp004/jch/EAGLE-XL/DMONLY/Cosma7/L0150N2256/tests/default/logs/L0150N2256.1879257.out .
I've updated velociraptor and resubmitted the job in case any of your recent changes help with this.
I've been able to reproduce this with the latest velociraptor master running in ddt. At the time of the hang all of the MPI ranks are in MPIBuildParticleExportListUsingMesh. Some of them are waiting at MPI_Sendrecv calls and some are waiting for MPI_Recvs. Unfortunately I had a typo in my cmake config so I don't have debugging symbols in this run. I'll have to restart it to get more information.
I think the problem here is that different ranks disagree about whether communications need to be split up. Adding
MPI_Allreduce(MPI_IN_PLACE, &bufferFlag, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
just after bufferflag is calculated might help.
Seems to be fixed by #75.
I'm trying to run a 2256^3 dark matter only simulation with Swift using velociraptor on the fly. I'm finding that the code sometimes hangs in the function MPIBuildParticleExportListUsingMesh() when velociraptor is called. In my most recent run, the first two velociraptor calls completed but the third got as far as reporting "now building exported particle list for FOF search" then produced no further output for about 10 hours.
From the log I think all of the processes had entered MPIBuildParticleExportListUsingMesh() before they got stuck - I'm running on 48 nodes and there were 144 instances of the "now building exported particle" message in the log.
I'm running commit 8da1f94dae758a86d9ec0589c4cb2426c3dd23ec from the master branch and it's configured with
The config file is the same as examples/sample_swiftdm_3dfof_subhalo.cfg from the Velociraptor repository, except that I set MPI_number_of_tasks_per_write to a large value.