Open jose-pinzon0407 opened 5 months ago
Alternative, more simple setup that can reproduce the issue on a smaller scale:
Run both alternatives option A and B from the following config.
Run with e.g.
OMP_NUM_THREADS mpirun -n 16 --oversubscribe src/MarDyn config.xml | tee output_OptX.out
Then compare the two outputs via:
grep -Erni -A 5 'Decomposition took' output*.out
The output from GeneralDomainDecomposition
should show longer Decomposition
time, especially significantly higher Communication
time. This value is measured by the timer SIMULATION_MPI_OMP_COMMUNICATION
which only measures the call to _domainDecomposition->balanceAndExchangeInitNonBlocking()
in Simulation.cpp
.
The difference from the perspective of the decompositions is that while both call DomainDecompMPIBase::exchangeMoleculesMPI()
, they use different modes:
DomainDecomposition
: LEAVING_AND_HALO_COPIES
GeneralDomainDecomposition
: LEAVING_ONLY
and HALO_COPIES
This however shouldn't really make a difference because the two calls from GDD should do exactly the same as the one call from DD. The latter also invokes moleculeContainer->deleteOuterParticles()
but GDD also does this between the two calls.
The non-rebalance case n GeneralDomainDecomposition
is exactly the same as in DomainDecomposition
and thus also equally fast (confirmed by tests!).
Throwing counters around every single method call in the rebalance part of GeneralDomainDecomposition::balanceAndExchange
indicates that GeneralDomainDecomposition::initCommPartners
takes the most time by a lot.
Rough sketch how this was done:
std::map<std::string, long> counters{
{"Rebalancing - numIterations", 0},
{"Rebalancing - exchangeMoleculesMPI 1", 0},
{"Rebalancing - deleteOuterParticles", 0},
{"Rebalancing - rebalance", 0},
{"Rebalancing - latchToGridSize", 0},
{"Rebalancing - migrateParticles", 0},
{"Rebalancing - initCommPartners", 0},
{"Rebalancing - exchangeMoleculesMPI 2", 0},
{"SendLeavingWithCopies - numIterations", 0},
{"SendLeavingWithCopies - exchangeMoleculesMPI", 0},
};
auto start = std::chrono::high_resolution_clock::now();
DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_ONLY);
counters["Rebalancing - exchangeMoleculesMPI 1"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
Dump everything in the destructor into one file per rank:
GeneralDomainDecomposition::~GeneralDomainDecomposition() {
int myRank{};
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
std::ofstream logFile{"counters_r" + std::to_string(myRank)};
for (const auto &[name, value] : counters) {
logFile << name << " = " << value << "\n";
}
}
for f in counters_r*
do
cat $f \
| column -t -s '-=' -o '|' \
| sort -t '|' -k3,3gr -o $f
done
head -n 3 counters_r*
After adding extensive timers to
GeneralDomainDecomposition::balanceAndExchange()
DirectNeighbourCommunicationScheme::initCommunicationPartners()
NeighborAcquirer::acquireNeighbors()
```diff
diff --git a/src/parallel/GeneralDomainDecomposition.cpp b/src/parallel/GeneralDomainDecomposition.cpp
index efd6aa750..e323e0c07 100644
--- a/src/parallel/GeneralDomainDecomposition.cpp
+++ b/src/parallel/GeneralDomainDecomposition.cpp
@@ -69,7 +69,16 @@ void GeneralDomainDecomposition::initializeALL() {
double GeneralDomainDecomposition::getBoundingBoxMin(int dimension, Domain* /*domain*/) { return _boxMin[dimension]; }
-GeneralDomainDecomposition::~GeneralDomainDecomposition() = default;
+GeneralDomainDecomposition::~GeneralDomainDecomposition() {
+
+ int myRank{};
+ MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
+
+ std::ofstream logFile{"counters_r" + std::to_string(myRank)};
+ for (const auto &[name, value] : counters) {
+ logFile << name << " = " << value << std::endl;
+ }
+}
double GeneralDomainDecomposition::getBoundingBoxMax(int dimension, Domain* /*domain*/) { return _boxMax[dimension]; }
@@ -80,9 +89,31 @@ bool GeneralDomainDecomposition::queryRebalancing(size_t step, size_t updateFreq
void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bool forceRebalancing,
ParticleContainer* moleculeContainer, Domain* domain) {
+
+ if (counters.empty()) {
+ counters["Rebalancing - numIterations"] = 0;
+ counters["Rebalancing - exchangeMoleculesMPI 1"] = 0;
+ counters["Rebalancing - deleteOuterParticles"] = 0;
+ counters["Rebalancing - rebalance"] = 0;
+ counters["Rebalancing - latchToGridSize"] = 0;
+ counters["Rebalancing - migrateParticles"] = 0;
+ counters["Rebalancing - initCommPartners"] = 0;
+ counters["Rebalancing - exchangeMoleculesMPI 2"] = 0;
+ counters["SendLeavingWithCopies - numIterations"] = 0;
+ counters["SendLeavingWithCopies - exchangeMoleculesMPI"] = 0;
+ }
+
const bool rebalance =
queryRebalancing(_steps, _rebuildFrequency, _initPhase, _initFrequency, lastTraversalTime) or forceRebalancing;
+ const auto moep = [&](const auto &msg) {
+ int myRank{0};
+ MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
+ if (myRank == 0 ) {
+ std::cout << msg << "\n";
+ }
+ };
if (_steps == 0) {
+ moep("_steps == 0");
// ensure that there are no outer particles
moleculeContainer->deleteOuterParticles();
// init communication partners
@@ -90,11 +121,16 @@ void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bo
DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, HALO_COPIES);
} else {
if (rebalance) {
+ ++counters["Rebalancing - numIterations"];
// first transfer leaving particles
+ auto start = std::chrono::high_resolution_clock::now();
DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_ONLY);
+ counters["Rebalancing - exchangeMoleculesMPI 1"] += std::chrono::duration_cast
the consistent highest count is GeneralDomainDecomposition::balanceAndExchange()
and in there the call to DomainDecompMPIBase::exchangeMoleculesMPI()
in the path not rebalance
-> sendLeavingWithCopies() == true
. This, however, is expected, since this is called in every non-rebalance iteration.
The second highest is quite consistently the call to initCommPartners
in the rebalance
branch. When divided by the number of invocations the call to initCommPartners
is several times more expensive.
The suspected offender is the repeated all-to-all communication in acquireNeighbors()
(esp.MPI_Allgather(&num_bytes_send, 1, MPI_INT, num_bytes_receive_vec.data(), 1, MPI_INT, comm);
) which is not called when using DomainDecomposition
.
Using load balancing
GeneralDomainDecomposition
is slower thanDomainDecomposition
, opposite to what is reported.Used version ls1-MarDyn version 1.2.0_master_98d77b7d3, commit 98d77b7d31073a12f0166f6f7315767b5f294616 compiled with gcc 12
Main Problem
The problem was tested using the droplet coalescence example. The issue was found using the last script be run at
examples DropletCoalescene/vle/config_5_droplet.xml
. This problem was found using 4 nodes with 72 tasks per node on HSuper. The process should run faster, if the parallelisation node is set toGeneralDomainDecomposition
and the loadBalancer node is set toall
. This should be faster than setting parallelisation toDomainDecomposition
.The configurations used are shown in the simulation output, not in consecutive lines, as:
DomainDecomposition
GeneralDomainDecomposition
Parallel Issues
Both configurations can be directly compared by checking the output line
Decomposition took
, for the previously described case we have two different measured times:Decomposition took: 32.665 sec
Decomposition took: 9.26896 sec
These times were obtained with the settings given above.
NeighbourCommunicationScheme
Starting at the
DomainDecompMPIBase
class, the issue can be traced to the routineexchangeMoleculesMPI
, which calls the class member_neighbourCommunicationsScheme
and its member functionexchangeMoleculesMPI
. Using theTimer
class inside that function (found inparallel/NeighbourCommunicationScheme.cpp
) for theDirectNeighbourCommunicationScheme
derived class shows that most of the time is spent inside an if statement which checks ifmsgType == LEAVING_AND_HALO_COPIES
. Further investigation is required.Adding timers
If someone wants to add timers to check this issue do as follows:
io/TimerProfiler.cpp
modify the functionreadInitialTimersFromFile
by adding your own category (optional) totimerClasses
, and your own timer totimerAttrs
(mandatory).Simulation.cpp
modifypreSimLoopSteps
by adding the output strings to your respective timers. The names have to match on both files.TimerProfiler
class in step 1, by usingglobal_simulation->timers()->getTimer("NAME")
. Then use start and stop.TimerProfiler
then they should be at the end.