Possible slow-down in htt (setting values in MPI code)

GiovanniBussi commented 4 months ago

@gtribello this is another possible slow-down. For this I don't have timings, but I can produce them.

I am confused by the following code in DomainDecomposition:

  dd.Allgatherv(&dd.indexToBeSent[0],count,&dd.indexToBeReceived[0],&counts[0],&displ[0]);
  dd.Allgatherv(&dd.positionsToBeSent[0],ndata*count,&dd.positionsToBeReceived[0],&counts5[0],&displ5[0]);
  int tot=displ[n-1]+counts[n-1];
  for(int i=0; i<tot; i++) {
    int dpoint=0;
    for(unsigned j=0; j<values_to_get.size(); ++j) {
      values_to_get[j]->set( dd.indexToBeReceived[i], dd.positionsToBeReceived[ndata*i+dpoint] );
      dpoint++;
    }
  }

It looks like there’s a large loop (over all shared atoms) in which we call a quite expensive function (void set(const std::size_t& n, const double& v );) which is not inlined and has quite a few if’s:

void Value::set(const std::size_t& n, const double& v ) {
  value_set=true;
  if( getRank()==0 ) { plumed_assert( n==0 ); data[n]=v; applyPeriodicity(n); }
  else if( !hasDeriv ) { plumed_dbg_massert( n<data.size(), "failing in " + getName() ); data[n]=v; applyPeriodicity(n); }
  else { data[n*(1+action->getNumberOfDerivatives())] = v; }
}

I think this is suboptimal (and might affect timings with MPI, that we stopped doing at some point). Is there a way to avoid this sequence of calls?

I also do not understand where's the equivalent call for the serial version. For comparison, in the old code what we were doing was something like this:

        positions[dd.indexToBeReceived[i]][0]=dd.positionsToBeReceived[ndata*i+0];
        positions[dd.indexToBeReceived[i]][1]=dd.positionsToBeReceived[ndata*i+1];
        positions[dd.indexToBeReceived[i]][2]=dd.positionsToBeReceived[ndata*i+2];

which is just accessing the elements of an array.

GiovanniBussi commented 4 months ago

@gtribello I found where this is done in the serial code. Basically, in the serial code we set the positions using a standard C++ loop in (ip->mydata)->share_data.

So, if I understand correctly:

We first copy the local atoms quickly with those shared_data methods
Then if we are using domain decomposition:
- we access them one by one with Value::get(), which is a fast inlined access to an array)
- then we shared them with MPI calls
- finally we set them back in place with Value::set(), which is a slow, non inlined function, with several ifs.

I would suggest fixing this. Perhaps adding a functionality in Value to allows to set multiple (all?) elements of the vector with a single call. Then it should be not even necessary to inline it.

I will try to have benchmarks with MPI, but to me it's quite evident that the current implementation is slowing down things.

gtribello commented 4 months ago

@GiovanniBussi I am looking into this issue and the other one you raised at the moment. I think I might have a solution to this one. I hope to have something to share by the end of the morning though. I am telling you this so that we are not working on the same thing at the same time.

GiovanniBussi commented 4 months ago

Closed by #1027

plumed / plumed2

Possible slow-down in htt (setting values in MPI code) #1025