ls1mardyn / ls1-mardyn

ls1-MarDyn is a massively parallel Molecular Dynamics (MD) code for large systems. Its main target is the simulation of thermodynamics and nanofluidics. ls1-MarDyn is designed with a focus on performance and easy extensibility.
http://www.ls1-mardyn.de
Other
28 stars 15 forks source link

Load Balancing is slower #321

Open jose-pinzon0407 opened 5 months ago

jose-pinzon0407 commented 5 months ago

Using load balancing GeneralDomainDecomposition is slower than DomainDecomposition, opposite to what is reported.

Used version ls1-MarDyn version 1.2.0_master_98d77b7d3, commit 98d77b7d31073a12f0166f6f7315767b5f294616 compiled with gcc 12

Main Problem

The problem was tested using the droplet coalescence example. The issue was found using the last script be run at examples DropletCoalescene/vle/config_5_droplet.xml. This problem was found using 4 nodes with 72 tasks per node on HSuper. The process should run faster, if the parallelisation node is set to GeneralDomainDecomposition and the loadBalancer node is set to all. This should be faster than setting parallelisation to DomainDecomposition.

The configurations used are shown in the simulation output, not in consecutive lines, as:

DomainDecomposition

Parallelisation type: DomainDecomposition
DomainDecompMPIBase: Using IndirectCommunicationScheme

GeneralDomainDecomposition

Parallelisation type: GeneralDomainDecomposition
DomainDecompMPIBase: Using DirectCommunicationScheme with push-pull neighbors
Chosen Load Balancer: all

Parallel Issues

Both configurations can be directly compared by checking the output line Decomposition took, for the previously described case we have two different measured times:

These times were obtained with the settings given above.

NeighbourCommunicationScheme

Starting at the DomainDecompMPIBase class, the issue can be traced to the routine exchangeMoleculesMPI, which calls the class member _neighbourCommunicationsScheme and its member function exchangeMoleculesMPI. Using the Timer class inside that function (found in parallel/NeighbourCommunicationScheme.cpp) for the DirectNeighbourCommunicationScheme derived class shows that most of the time is spent inside an if statement which checks if msgType == LEAVING_AND_HALO_COPIES. Further investigation is required.

Adding timers

If someone wants to add timers to check this issue do as follows:

  1. In io/TimerProfiler.cpp modify the function readInitialTimersFromFile by adding your own category (optional) to timerClasses, and your own timer to timerAttrs (mandatory).
  2. In Simulation.cpp modify preSimLoopSteps by adding the output strings to your respective timers. The names have to match on both files.
  3. In the file you want to measure at, create a pointer to your timer using the given name from the TimerProfiler class in step 1, by using global_simulation->timers()->getTimer("NAME") . Then use start and stop.
  4. Your measured time should be visible in the output, if you created your own category at the end of the list in TimerProfiler then they should be at the end.
FG-TUM commented 2 months ago

Alternative, more simple setup that can reproduce the issue on a smaller scale:

Run both alternatives option A and B from the following config.

config.xml ```xml 0.1 1 1 0.00182367 0.0 100 110 100 100 100 0.0 0.0 0.0 1 1 1 true 1 0 0 0 1 0 0 0 1 0 0.5 0.5 0.5 1 0.000501 0 0 0 100 100 100 SIMULATION_FORCE_CALCULATION SIMULATION_FORCE_CALCULATION 5 1 2.5 2.5 1.0e+10 ```

Run with e.g.

OMP_NUM_THREADS mpirun -n 16 --oversubscribe src/MarDyn config.xml | tee output_OptX.out

Then compare the two outputs via:

grep -Erni -A 5 'Decomposition took' output*.out

The output from GeneralDomainDecomposition should show longer Decomposition time, especially significantly higher Communication time. This value is measured by the timer SIMULATION_MPI_OMP_COMMUNICATION which only measures the call to _domainDecomposition->balanceAndExchangeInitNonBlocking() in Simulation.cpp.

FG-TUM commented 2 months ago

The difference from the perspective of the decompositions is that while both call DomainDecompMPIBase::exchangeMoleculesMPI(), they use different modes:

This however shouldn't really make a difference because the two calls from GDD should do exactly the same as the one call from DD. The latter also invokes moleculeContainer->deleteOuterParticles() but GDD also does this between the two calls.

FG-TUM commented 2 months ago

The non-rebalance case n GeneralDomainDecomposition is exactly the same as in DomainDecomposition and thus also equally fast (confirmed by tests!).

Throwing counters around every single method call in the rebalance part of GeneralDomainDecomposition::balanceAndExchange indicates that GeneralDomainDecomposition::initCommPartners takes the most time by a lot.

Rough sketch how this was done:

  1. Add a class member and initialize it:
    std::map<std::string, long> counters{
       {"Rebalancing - numIterations", 0},
       {"Rebalancing - exchangeMoleculesMPI 1", 0},
       {"Rebalancing - deleteOuterParticles", 0},
       {"Rebalancing - rebalance", 0},
       {"Rebalancing - latchToGridSize", 0},
       {"Rebalancing - migrateParticles", 0},
       {"Rebalancing - initCommPartners", 0},
       {"Rebalancing - exchangeMoleculesMPI 2", 0},
       {"SendLeavingWithCopies - numIterations", 0},
       {"SendLeavingWithCopies - exchangeMoleculesMPI", 0},
    };
  2. Add measurements around every method call e.g.:
    auto start = std::chrono::high_resolution_clock::now();
    DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_ONLY);
    counters["Rebalancing - exchangeMoleculesMPI 1"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
  3. Dump everything in the destructor into one file per rank:

    GeneralDomainDecomposition::~GeneralDomainDecomposition() {
    
    int myRank{};
    MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
    
    std::ofstream logFile{"counters_r" + std::to_string(myRank)};
    for (const auto &[name, value] : counters) {
        logFile << name << " = " << value << "\n";
    }
    }
  4. Pretty-print and investigate the top 3 offenders per Rank:
    for f in counters_r* 
    do
     cat $f \
       | column -t -s '-=' -o '|' \
       | sort -t '|' -k3,3gr -o $f
    done
    head -n 3 counters_r*
FG-TUM commented 2 months ago

After adding extensive timers to

diff file to add the counters

```diff diff --git a/src/parallel/GeneralDomainDecomposition.cpp b/src/parallel/GeneralDomainDecomposition.cpp index efd6aa750..e323e0c07 100644 --- a/src/parallel/GeneralDomainDecomposition.cpp +++ b/src/parallel/GeneralDomainDecomposition.cpp @@ -69,7 +69,16 @@ void GeneralDomainDecomposition::initializeALL() { double GeneralDomainDecomposition::getBoundingBoxMin(int dimension, Domain* /*domain*/) { return _boxMin[dimension]; } -GeneralDomainDecomposition::~GeneralDomainDecomposition() = default; +GeneralDomainDecomposition::~GeneralDomainDecomposition() { + + int myRank{}; + MPI_Comm_rank(MPI_COMM_WORLD, &myRank); + + std::ofstream logFile{"counters_r" + std::to_string(myRank)}; + for (const auto &[name, value] : counters) { + logFile << name << " = " << value << std::endl; + } +} double GeneralDomainDecomposition::getBoundingBoxMax(int dimension, Domain* /*domain*/) { return _boxMax[dimension]; } @@ -80,9 +89,31 @@ bool GeneralDomainDecomposition::queryRebalancing(size_t step, size_t updateFreq void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bool forceRebalancing, ParticleContainer* moleculeContainer, Domain* domain) { + + if (counters.empty()) { + counters["Rebalancing - numIterations"] = 0; + counters["Rebalancing - exchangeMoleculesMPI 1"] = 0; + counters["Rebalancing - deleteOuterParticles"] = 0; + counters["Rebalancing - rebalance"] = 0; + counters["Rebalancing - latchToGridSize"] = 0; + counters["Rebalancing - migrateParticles"] = 0; + counters["Rebalancing - initCommPartners"] = 0; + counters["Rebalancing - exchangeMoleculesMPI 2"] = 0; + counters["SendLeavingWithCopies - numIterations"] = 0; + counters["SendLeavingWithCopies - exchangeMoleculesMPI"] = 0; + } + const bool rebalance = queryRebalancing(_steps, _rebuildFrequency, _initPhase, _initFrequency, lastTraversalTime) or forceRebalancing; + const auto moep = [&](const auto &msg) { + int myRank{0}; + MPI_Comm_rank(MPI_COMM_WORLD, &myRank); + if (myRank == 0 ) { + std::cout << msg << "\n"; + } + }; if (_steps == 0) { + moep("_steps == 0"); // ensure that there are no outer particles moleculeContainer->deleteOuterParticles(); // init communication partners @@ -90,11 +121,16 @@ void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bo DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, HALO_COPIES); } else { if (rebalance) { + ++counters["Rebalancing - numIterations"]; // first transfer leaving particles + auto start = std::chrono::high_resolution_clock::now(); DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_ONLY); + counters["Rebalancing - exchangeMoleculesMPI 1"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); // ensure that there are no outer particles + start = std::chrono::high_resolution_clock::now(); moleculeContainer->deleteOuterParticles(); + counters["Rebalancing - deleteOuterParticles"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); // rebalance Log::global_log->debug() << "rebalancing..." << std::endl; @@ -102,13 +138,21 @@ void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bo Log::global_log->set_mpi_output_all(); Log::global_log->debug() << "work:" << lastTraversalTime << std::endl; Log::global_log->set_mpi_output_root(0); + start = std::chrono::high_resolution_clock::now(); auto [newBoxMin, newBoxMax] = _loadBalancer->rebalance(lastTraversalTime); + counters["Rebalancing - rebalance"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + if (_latchGridSize.has_value()) { + start = std::chrono::high_resolution_clock::now(); std::tie(newBoxMin, newBoxMax) = latchToGridSize(newBoxMin, newBoxMax); + counters["Rebalancing - latchToGridSize"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + } // migrate the particles, this will rebuild the moleculeContainer! Log::global_log->debug() << "migrating particles" << std::endl; + start = std::chrono::high_resolution_clock::now(); migrateParticles(domain, moleculeContainer, newBoxMin, newBoxMax); + counters["Rebalancing - migrateParticles"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); #ifndef MARDYN_AUTOPAS // The linked cells container needs this (I think just to set the cells to valid...) @@ -121,13 +165,24 @@ void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bo // init communication partners Log::global_log->debug() << "updating communication partners" << std::endl; + start = std::chrono::high_resolution_clock::now(); initCommPartners(moleculeContainer, domain); + counters["Rebalancing - initCommPartners"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + Log::global_log->debug() << "rebalancing finished" << std::endl; + start = std::chrono::high_resolution_clock::now(); DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, HALO_COPIES); + counters["Rebalancing - exchangeMoleculesMPI 2"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + } else { if (sendLeavingWithCopies()) { + ++counters["SendLeavingWithCopies - numIterations"]; + const auto start = std::chrono::high_resolution_clock::now(); DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_AND_HALO_COPIES); + counters["SendLeavingWithCopies - exchangeMoleculesMPI"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + } else { + moep("not sendLeavingWithCopies"); DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_ONLY); moleculeContainer->deleteOuterParticles(); DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, HALO_COPIES); diff --git a/src/parallel/GeneralDomainDecomposition.h b/src/parallel/GeneralDomainDecomposition.h index 12dd37f02..253610c38 100644 --- a/src/parallel/GeneralDomainDecomposition.h +++ b/src/parallel/GeneralDomainDecomposition.h @@ -13,6 +13,7 @@ #include "particleContainer/ParticleContainer.h" #include "DomainDecompMPIBase.h" #include "LoadBalancer.h" +#include "map" /** * This decomposition is meant to be able to call arbitrary load balancers. @@ -94,6 +95,8 @@ public: throw std::runtime_error("GeneralDomainDecomposition::getNeighboursFromHaloRegion() not yet implemented"); } + std::map counters{}; + private: /** * Method that initializes the ALLLoadBalancer diff --git a/src/parallel/NeighborAcquirer.cpp b/src/parallel/NeighborAcquirer.cpp index 7f1d00784..6b74ba891 100644 --- a/src/parallel/NeighborAcquirer.cpp +++ b/src/parallel/NeighborAcquirer.cpp @@ -9,6 +9,30 @@ #include "Domain.h" #include "HaloRegion.h" +std::map NeighborAcquirer::counters{ + {"acquireNeighbors - init", 0}, + {"acquireNeighbors - fill send buffer", 0}, + {"acquireNeighbors - count global receive bytes (Allgather)", 0}, + {"acquireNeighbors - build stride", 0}, + {"acquireNeighbors - gather regions (Allgatherv)", 0}, + {"acquireNeighbors - parse received regions", 0}, + {"acquireNeighbors - merge char arrays", 0}, + {"acquireNeighbors - count regions to send (Allreduce)", 0}, + {"acquireNeighbors - send regions to send (Isend)", 0}, + {"acquireNeighbors - receive regions and deserialize (Get_count, Recv)", 0}, + {"acquireNeighbors - synchronize (Wait)", 0}, + {"acquireNeighbors - barrier (Barrier)", 0}, +}; +void NeighborAcquirer::dumpCounters() { + int myRank{}; + MPI_Comm_rank(MPI_COMM_WORLD, &myRank); + + std::ofstream logFile{"countersNeighborAcqu_r" + std::to_string(myRank)}; + for (const auto &[name, value] : counters) { + logFile << name << " = " << value << '\n'; + } +} + /* * 1. Initial Exchange of all desired regions. * 2. Each process checks whether he owns parts of the desired regions and will save those regions in partners02. @@ -21,13 +45,16 @@ std::tuple, std::vector> const std::array &globalDomainLength, HaloRegion *ownRegion, const std::vector &desiredRegions, const MPI_Comm &comm, bool excludeOwnRank) { + auto start = std::chrono::high_resolution_clock::now(); int my_rank{}; // my rank MPI_Comm_rank(comm, &my_rank); int num_processes{}; // the number of processes in comm MPI_Comm_size(comm, &num_processes); const auto num_regions = desiredRegions.size(); // the number of regions I would like to acquire from other processes + counters["acquireNeighbors - init"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + start = std::chrono::high_resolution_clock::now(); // tell the other processes how much you are going to send // how many bytes am I going to send to all the other processes const int num_bytes_send = @@ -54,24 +81,31 @@ std::tuple, std::vector> memcpy(outgoingDesiredRegionsVector.data() + bufferPosition, ®ion.width, sizeof(double)); bufferPosition += sizeof(double); } + counters["acquireNeighbors - fill send buffer"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + start = std::chrono::high_resolution_clock::now(); // set up structure information data for the Allgatherv operation // vector of number of bytes I am going to receive std::vector num_bytes_receive_vec(num_processes, 0); MPI_Allgather(&num_bytes_send, 1, MPI_INT, num_bytes_receive_vec.data(), 1, MPI_INT, comm); + counters["acquireNeighbors - count global receive bytes (Allgather)"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); // vector of offsets (=displacement in MPI) in the receive buffer + start = std::chrono::high_resolution_clock::now(); std::vector num_bytes_displacements(num_processes, 0); int num_bytes_receive = 0; for (int j = 0; j < num_processes; j++) { num_bytes_displacements[j] = num_bytes_receive; num_bytes_receive += num_bytes_receive_vec[j]; } + counters["acquireNeighbors - build stride"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + start = std::chrono::high_resolution_clock::now(); std::vector incomingDesiredRegionsVector(num_bytes_receive); // the incoming byte buffer // send your regions MPI_Allgatherv(outgoingDesiredRegionsVector.data(), num_bytes_send, MPI_BYTE, incomingDesiredRegionsVector.data(), num_bytes_receive_vec.data(), num_bytes_displacements.data(), MPI_BYTE, comm); + counters["acquireNeighbors - gather regions (Allgatherv)"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); std::vector numberOfRegionsToSendToRank(num_processes, 0); // outgoing row @@ -80,8 +114,8 @@ std::tuple, std::vector> sizeof(double) * 3 + sizeof(double) * 3 + sizeof(int) * 3 + sizeof(double) + sizeof(double) * 3; // the regions I own and want to send: ranks> std::vector>> sendingList(num_processes); - std::vector comm_partners02{}; + start = std::chrono::high_resolution_clock::now(); bufferPosition = 0; while (bufferPosition < num_bytes_receive /*== buffer length*/) { @@ -164,7 +198,9 @@ std::tuple, std::vector> } } } + counters["acquireNeighbors - parse received regions"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + start = std::chrono::high_resolution_clock::now(); std::vector> merged(num_processes); // Merge each list of char arrays into one char array for (int j = 0; j < num_processes; j++) { if (numberOfRegionsToSendToRank[j] > 0) { @@ -177,6 +213,7 @@ std::tuple, std::vector> merged[j] = std::move(mergedRegions); } } + counters["acquireNeighbors - merge char arrays"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); // We cannot know how many regions we are going to receive from each process a-priori. // So we need to figure this out. @@ -202,11 +239,13 @@ std::tuple, std::vector> * In this case, process 0 will send 3 regions to process 2 and process 1 will send 2 regions to process 3. * After the Allreduce step every process has the information how many regions it will receive. */ + start = std::chrono::high_resolution_clock::now(); std::vector numberOfRegionsToReceive(num_processes, 0); // how many bytes does each process expect? MPI_Allreduce(numberOfRegionsToSendToRank.data(), numberOfRegionsToReceive.data(), num_processes, MPI_INT, MPI_SUM, comm); + counters["acquireNeighbors - count regions to send (Allreduce)"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); // all the information for the final information exchange has been collected -> final exchange - + start = std::chrono::high_resolution_clock::now(); std::vector requests(num_processes, MPI_REQUEST_NULL); MPI_Status probe_status; MPI_Status rec_status; @@ -218,7 +257,9 @@ std::tuple, std::vector> &requests[j]); // tag is one } } + counters["acquireNeighbors - send regions to send (Isend)"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + start = std::chrono::high_resolution_clock::now(); std::vector comm_partners01; // the communication partners // receive data (blocking) @@ -265,14 +306,18 @@ std::tuple, std::vector> region.offset, enlarged); } } + counters["acquireNeighbors - receive regions and deserialize (Get_count, Recv)"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + start = std::chrono::high_resolution_clock::now(); // ensure that all sends have been finished. for (int j = 0; j < num_processes; j++) { if (numberOfRegionsToSendToRank[j] > 0) MPI_Wait(&requests[j], MPI_STATUS_IGNORE); } + counters["acquireNeighbors - synchronize (Wait)"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); - // barrier for safety. + start = std::chrono::high_resolution_clock::now(); MPI_Barrier(comm); + counters["acquireNeighbors - barrier (Barrier)"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); return std::make_tuple(squeezePartners(comm_partners01), squeezePartners(comm_partners02)); } diff --git a/src/parallel/NeighborAcquirer.h b/src/parallel/NeighborAcquirer.h index 0f92e35ce..2f4f6056a 100644 --- a/src/parallel/NeighborAcquirer.h +++ b/src/parallel/NeighborAcquirer.h @@ -34,6 +34,8 @@ public: static std::vector squeezePartners(const std::vector& partners); + static void dumpCounters(); + private: static bool isIncluded(HaloRegion* myRegion, HaloRegion* inQuestion); @@ -51,5 +53,7 @@ private: static std::pair, std::vector>> getPotentiallyShiftedRegions( const std::array& domainLength, const HaloRegion& region); + static std::map counters; + friend class NeighborAcquirerTest; }; diff --git a/src/parallel/NeighbourCommunicationScheme.cpp b/src/parallel/NeighbourCommunicationScheme.cpp index ab4190d75..6b7b49b53 100644 --- a/src/parallel/NeighbourCommunicationScheme.cpp +++ b/src/parallel/NeighbourCommunicationScheme.cpp @@ -52,6 +52,14 @@ NeighbourCommunicationScheme::NeighbourCommunicationScheme( } NeighbourCommunicationScheme::~NeighbourCommunicationScheme() { + int myRank{}; + MPI_Comm_rank(MPI_COMM_WORLD, &myRank); + + std::ofstream logFile{"countersNeighbourCom_r" + std::to_string(myRank)}; + for (const auto &[name, value] : counters) { + logFile << name << " = " << value << std::endl; + } + NeighborAcquirer::dumpCounters(); if (_pushPull) { delete _haloExportForceImportNeighbours; delete _haloImportForceExportNeighbours; @@ -433,8 +441,20 @@ void NeighbourCommunicationScheme::selectNeighbours(MessageType msgType, bool im void DirectNeighbourCommunicationScheme::initCommunicationPartners(double cutoffRadius, Domain * domain, DomainDecompMPIBase* domainDecomp, ParticleContainer* moleculeContainer) { + if (counters.empty()) { + counters["initArrays"] = 0; + counters["clearVectors"] = 0; + counters["ownHalo"] = 0; + counters["getHaloSize"] = 0; + counters["getHaloImportForceExportRegions"] = 0; + counters["getLeavingExportRegions"] = 0; + counters["globalDomainLength"] = 0; + counters["acquireNeighborsHalo"] = 0; + counters["acquireNeighborsLeaving"] = 0; + } // corners of the process-specific domain static_assert(DIMgeom == 3); // The initialization here assumes 3 dimensions! + auto start = std::chrono::high_resolution_clock::now(); const std::array localLowerCorner{ domainDecomp->getBoundingBoxMin(0, domain), domainDecomp->getBoundingBoxMin(1, domain), @@ -447,37 +467,59 @@ void DirectNeighbourCommunicationScheme::initCommunicationPartners(double cutoff }; if (_pushPull) { + start = std::chrono::high_resolution_clock::now(); for (unsigned int d = 0; d < _commDimms; d++) { // why free? (*_haloExportForceImportNeighbours)[d].clear(); (*_haloImportForceExportNeighbours)[d].clear(); (*_leavingExportNeighbours)[d].clear(); (*_leavingImportNeighbours)[d].clear(); } + counters["initArrays"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + } else { for (unsigned int d = 0; d < _commDimms; d++) { // why free? (*_neighbours)[d].clear(); } } + start = std::chrono::high_resolution_clock::now(); HaloRegion ownRegion = {localLowerCorner[0], localLowerCorner[1], localLowerCorner[2], localUpperCorner[0], localUpperCorner[1], localUpperCorner[2], 0, 0, 0, cutoffRadius}; + counters["ownRegion"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); if (_pushPull) { + start = std::chrono::high_resolution_clock::now(); double* const cellLength = moleculeContainer->getHaloSize(); + counters["getHaloSize"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + // halo/force regions + start = std::chrono::high_resolution_clock::now(); std::vector haloOrForceRegions = _zonalMethod->getHaloImportForceExportRegions(ownRegion, cutoffRadius, _coversWholeDomain, cellLength); + counters["getHaloImportForceExportRegions"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + + start = std::chrono::high_resolution_clock::now(); std::vector leavingRegions = _zonalMethod->getLeavingExportRegions(ownRegion, cutoffRadius, _coversWholeDomain); + counters["getLeavingExportRegions"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + start = std::chrono::high_resolution_clock::now(); const std::array globalDomainLength{domain->getGlobalLength(0), domain->getGlobalLength(1), domain->getGlobalLength(2)}; + counters["globalDomainLength"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + // assuming p1 sends regions to p2 + start = std::chrono::high_resolution_clock::now(); std::tie((*_haloImportForceExportNeighbours)[0], (*_haloExportForceImportNeighbours)[0]) = NeighborAcquirer::acquireNeighbors(globalDomainLength, &ownRegion, haloOrForceRegions, domainDecomp->getCommunicator(), _useSequentialFallback); + counters["acquireNeighborsHalo"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + // p1 notes reply, p2 notes owned as haloExportForceImport + start = std::chrono::high_resolution_clock::now(); std::tie((*_leavingExportNeighbours)[0], (*_leavingImportNeighbours)[0]) = NeighborAcquirer::acquireNeighbors( globalDomainLength, &ownRegion, leavingRegions, domainDecomp->getCommunicator(), _useSequentialFallback); + counters["acquireNeighborsLeaving"] += std::chrono::duration_cast(std::chrono::high_resolution_clock::now() - start).count(); + // p1 notes reply, p2 notes owned as leaving import } else { diff --git a/src/parallel/NeighbourCommunicationScheme.h b/src/parallel/NeighbourCommunicationScheme.h index a9c3d5ebc..b32171a3f 100644 --- a/src/parallel/NeighbourCommunicationScheme.h +++ b/src/parallel/NeighbourCommunicationScheme.h @@ -123,6 +123,10 @@ protected: bool _pushPull; bool _useSequentialFallback{true}; + +public: + std::map counters{}; + }; class DirectNeighbourCommunicationScheme: public NeighbourCommunicationScheme { ```

the consistent highest count is GeneralDomainDecomposition::balanceAndExchange() and in there the call to DomainDecompMPIBase::exchangeMoleculesMPI() in the path not rebalance -> sendLeavingWithCopies() == true. This, however, is expected, since this is called in every non-rebalance iteration. The second highest is quite consistently the call to initCommPartners in the rebalance branch. When divided by the number of invocations the call to initCommPartners is several times more expensive.

The suspected offender is the repeated all-to-all communication in acquireNeighbors() (esp.MPI_Allgather(&num_bytes_send, 1, MPI_INT, num_bytes_receive_vec.data(), 1, MPI_INT, comm);) which is not called when using DomainDecomposition.