libMesh / libmesh

libMesh github repository
http://libmesh.github.io
GNU Lesser General Public License v2.1
642 stars 285 forks source link

Allow for alternate communicators in parallel objects #71

Closed benkirk closed 11 years ago

benkirk commented 11 years ago

OK, the time has come. In the subcell integration stuff I'm working I'd like to be able to do

  SerialMesh my_local_mesh(MPI_COMM_SELF);

This is a necessary step for being able to split an MPI communicator and execute physics A on one portion of the parallel resource and physics B on another.

My plan of attack is to begin with t he Mesh classes, since that is what I need now, and then propagate to the EquationSystems later.

Basically, the Mesh will have an optional constructor argument that takes a reference to a Parallel::Communicator, defaulting to CommWorld.

Once the mesh is done the EquationSystems will inherit its communicator from the Mesh, so that will be easy, just tedious.

@roystgnr, any other pitfalls I'm not thinking about?

In order to keep things like parallel_only() simple I may need to impose a standard naming convention, like have class_parallel_only() expand to

   #define class_parallel_only() do { \
    libmesh_assert(this->communicator().verify(std::string(__FILE__).size())); \
    libmesh_assert(this->communicator().verify(std::string(__FILE__))); \
    libmesh_assert(this->communicator().verify(__LINE__)); } while (0)

So here I assume that any class with a communicator object has a method communicator() that returns that object...

friedmud commented 11 years ago

lol - oh god!

I just got all of this working in my own code... the thought of having to change all of that.... sigh. ;-)

Hopefully my hacks for swapping out the libMesh::CommWorld communicator will continue to work while you do this work ;-)

benkirk commented 11 years ago

Thanks for reminding me - I think it can if that is a design goal.

Basically each class will need a pointer to its own communicator rather than a reference if it is to be switched behind the scenes... Hmm...

-Ben

On Apr 2, 2013, at 11:20 AM, "Derek Gaston" notifications@github.com<mailto:notifications@github.com> wrote:

lol - oh god!

I just got all of this working in my own code... the thought of having to change all of that.... sigh. ;-)

Hopefully my hacks for swapping out the libMesh::CommWorld communicator will continue to work while you do this work ;-)

— Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/issues/71#issuecomment-15785715.

roystgnr commented 11 years ago

Is it a design goal to allow each class to switch its communicator behind the scenes? Derek needed to swap CommWorld in and out as a workaround to "fake" the effect of having different per-object communicators; once we actually have real per-object communicators I can't see under what conditions we'd want to swap them in and out.

benkirk commented 11 years ago

@roystgnr, what are your thoughts about Parallel::Communicator references vs. Communicator::duplicate() everywhere?

For example, ultimately should the EquationSystems store a reference to it's Mesh's communicator, or should it store its own that is simply setup via Communicator::duplicate()?

I can see benefits to duplicating - the integer space for message tags is then disjoint, but I don't have a feel for the cost any other downsides...

roystgnr commented 11 years ago

I don't even see benefits to duplicating unless we're writing dangrously asynchronous code - if we have any synchronized points at all in an algorithm we ought to be able to use them to get_unique_tag() rather than tagging with magic numbers in the first place.

On the other hand, I don't see any downsides to duplicating either. AFAIK MPI communicators aren't supposed to have much overhead, and we don't add much more ourselves.

roystgnr commented 11 years ago

My vote would be to store references rather than to duplicate. I'm happy either way if others disagree.

friedmud commented 11 years ago

Well, Jed Brown posted something about this on Google+ a while back:

"V1R2M0 makes MPI_Comm_dup use O(P) memory on each rank. Apparently the regression was caused by fixing something else in PAMI for POWER7... This is yet another reason for responsible management of MPI communicators."

Personally, I would say just have the EquationSystems object reference the MPI Comm from the Mesh. That way calling communicator() will return exactly the same object for all objects associated with that Mesh/ES.

benkirk commented 11 years ago

I'll go with references as a start then, easy enough to refactor later if needed.

benkirk commented 11 years ago

I've started down the implementation. This is useful for scoping the problem:

grep 'CommWorld\|libMesh::n_processors\|libMesh::processor_id' src/*/*.C include/*/*.h

I'm implementing a ParallelObject base class, which is going to make this very clean:

  class ParallelObject
  {
  public:

    /**
     * Constructor. Requires a reference to the communicator
     * that defines the object's parallel decomposition.
     */
    ParallelObject (const Parallel::Communicator &comm) :
      _communicator(comm)
    {}

    /**
     * Copy Constructor.
     */
    ParallelObject (const ParallelObject &other) :
      _communicator(other._communicator)
    {}

    /**
     * Destructor.  Virtual because we are a base class.
     */
    virtual ~ParallelObject () {};

    /**
     * @returns a reference to the \p Parallel::Communicator object
     * used by this mesh.
     */
    const Parallel::Communicator & communicator () const
    { return _communicator; }

    /**
     * @returns the number of processors used in the
     * current simulation.
     */
    processor_id_type n_processors () const
    { return libmesh_cast_int<processor_id_type>(_communicator.size()); }

    /**
     * @returns the subdomain id for this processor.
     */
    processor_id_type processor_id () const
    { return libmesh_cast_int<processor_id_type>(_communicator.rank()); }

  protected:

    const Parallel::Communicator &_communicator;
  };

The trick then is Mesh, DofMap, EquationSystems, etc... are all ParallelObjects and share the same communicator by construction.

So you can construct a mesh with its own communicator, and that gets propagated all the way down.

Then mostly a bunch of replacing:

  CommWorld                     --> this->communicator()
  libMesh::n_processors() --> this->n_processors()
  libMesh::processor_id()  --> this->processor_id()
roystgnr commented 11 years ago

This looks good provisionally - we're not going to get any diamond inheritance patterns out of this plan, right?

benkirk commented 11 years ago

Doesn't look like it.

benkirk commented 11 years ago

I'm working on this at https://github.com/benkirk/libmesh/tree/communicators if anyone is interested. Pretty tedious but straightforwad.

I've done the mesh and equation system stuff, next major hurtle will be linear algebra data structures.

friedmud commented 11 years ago

Ok - I'm going to jump in here. I'll see what I can get done tonight...

Derek

benkirk commented 11 years ago

Thanks, but no biggie if you want to hold off, I can probably wrap it up tomorrow...

On Apr 3, 2013, at 7:00 PM, "Derek Gaston" notifications@github.com<mailto:notifications@github.com> wrote:

Ok - I'm going to jump in here. I'll see what I can get done tonight...

Derek

— Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/issues/71#issuecomment-15872966.

friedmud commented 11 years ago

As long as you're not working on it tonight (so we're not doubling up) - I'll continue the work tonight and then give you a pull request before I go to sleep.

benkirk commented 11 years ago

Awesome - I haven't touched any of the parallel linear algebra stuff, so in particular NumericVector and SparseMatrix being derived from ParallelObject would be great.

On Apr 3, 2013, at 7:05 PM, "Derek Gaston" notifications@github.com<mailto:notifications@github.com> wrote:

As long as you're not working on it tonight (so we're not doubling up) - I'll continue the work tonight and then give you a pull request before I go to sleep.

— Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/issues/71#issuecomment-15873112.

benkirk commented 11 years ago

So it may look a little weird, but when NumericVector is derived from ParallelObject it's constructor will need either a parent ParallelObject or a Communicator.

That's where you provide a reference to the containing class (e.g. System) to make sure they share the same underlying communicator.

Thanks!

-Ben

On Apr 3, 2013, at 7:05 PM, "Derek Gaston" notifications@github.com<mailto:notifications@github.com> wrote:

As long as you're not working on it tonight (so we're not doubling up) - I'll continue the work tonight and then give you a pull request before I go to sleep.

— Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/issues/71#issuecomment-15873112.

friedmud commented 11 years ago

Yep - I got it figured out. I put in a Pull Request for you (on your Fork).

I only got NumericVectors worked out... damn it's fiddly. It's tough to ferret out all the places where these objects get built and make sure a communicator is getting passed in!

One thing that's not great with my patches is the way I deal with PetscVector::PetscVector(Vec) and EpetraVector::EpetraVector(Epetra_Vector)... those constructors that take native objects and create libMesh objects out of them.

You can technically get the MPI_Comm from both Petsc Vecs and Epetra_Vectors... but I couldn't see a great clean way to get those out and create a Communicator from them on the fly in the constructor (to pass down to NumericVector and ultimately to ParallelObject). I'm sure it can be done (you could do it with a function call for instance... but I didn't know about how the memory would be handled)... but for now I just made them take a Communicator just like the other constructors. Which means you have to be careful when you're creating a libMesh NumericVector from a native object....

Speaking of that, there are two places where we do just that that I couldn't fix up: callbacks for ShellMatrix and Preconditioners. Those aren't done yet because those objects need to be turned into ParallelObjects so we can get the Communicator from them to pass into the NumericVector (see the above paragraph ;-). I did put commented out stubs in there to remind us that that needs to be done.

Anyway... hopefully I didn't hork it up too much. Sorry it wasn't more... it took a bit longer than I thought ;-)

I can do Matrices tomorrow morning if you want. Just let me know.

friedmud commented 11 years ago

One more thing... what do you think about being able to do:

./configure --disable-default-communicator

??

It would definitely be useful in tracking down all the places in an existing code where we're accidentally passing in CommWorld and we don't mean to.... not too mention all the places in libMesh itself! ;-)

benkirk commented 11 years ago

I was thinking about that... My only worry is all the places we are using it as a default constructor argument - we'll have to change those lines of code conditionally.

Would you then propose something like

#define LIBMESH_DEFAULT_COMM_OBJECT =libMesh::CommWorld

...

explicit
  NumericVector (const ParallelType ptype = AUTOMATIC,
                 const Parallel::Communicator &comm LIBMESH_DEFAULT_COMM_OBJECT);

Should be doable. The other bigge will be fixing parallel_only() to take a communicator.

roystgnr commented 11 years ago

Let's add a new libmesh_parallel_only() which takes a communicator? That'll help avoid name clashes in the future while still preserving backwards compatibility for the original macro name for a while.

For testing purposes, maybe we could optionally initialize CommWorld to be a Communicator subclass that can be duplicated properly but that

screams and dies if anyone tries to use it for anything else??

Roy

benkirk commented 11 years ago

On Apr 4, 2013, at 8:58 AM, roystgnr notifications@github.com wrote:

Let's add a new libmesh_parallel_only() which takes a communicator? That'll help avoid name clashes in the future while still preserving backwards compatibility for the original macro name for a while.

Sounds good. Also, anything that is a ParallelObject should have

parallel_object_only();

which relies on this->communicator() working, but I haven't started the replacement yet.

For testing purposes, maybe we could optionally initialize CommWorld to be a Communicator subclass that can be duplicated properly but that screams and dies if anyone tries to use it for anything else??

Hmm… something like RemoteElem? Good idea.

In that case ParallelObject would need virtual functions, which is fine but I'd say since it is so short we should make two implementations depending on this option, lest this->communicator() and this->processor_id() (which are everywhere) become virtual function calls all the time.

-Ben

friedmud commented 11 years ago

I was thinking something simpler for --disable-default-comm

Just make libMesh::defaultComm(). It normally will return libMesh::CommWorld but if you use --disable-default-comm then it just throws an error.

That way all of the constructors will look like:

Object(Stuff stuff, const Parallel::Communicator &comm = libMesh::defaultComm())

That's not too fiddly. Sure, it doesn't give us compile time errors - but in general we won't be able to do that anyway (because we are adding "comm" as the last argument to a lot of functions that already have default arguments... so we can't just omit the default for comm).

At least this way we won't ever silently be getting a CommWorld slipped in when we're trying to work on a sub-communicator.

benkirk commented 11 years ago

Ahh, that makes a lot of sense. I was trying to figure out the best way to comment out libMesh::CommWorld for testing purposes!

BTW, I'm well in to the matrix stuff ATM, and thanks for last night!

(uhh....)

friedmud commented 11 years ago

lol - anytime ;-)

roystgnr commented 11 years ago

Do we really need ParallelObject to wrap n_processors() etc.? Why not just make the member reference a bit terser? "this->comm->n_processors()" isn't too much worse than "libMesh::n_processors()", and we're going to need to be doing "this->comm->send()" etc. anyway so we might as well give that idiom as much brevity as the stupid C++ dependent-name-lookup rules allow.

friedmud commented 11 years ago

I actually agree Roy. In MOOSE I believe I'm going to name the member _comm... so that our users can do _comm->send(), etc.

this->comm->n_processors() is not bad at all. But for some reason this->communicator()->send() feels like it's too long ;-)

roystgnr commented 11 years ago

I've come up with about three different alternatives now to Derek's "libMesh::defaultComm()" idea, none of which are as good as his, which is evidence that he's found the right way to proceed. But there's still something that bothers me: if we don't do something to make "CommWorld.whatever()" unsafe to use, then our --disable-default-comm option will only be debugging constructions with missing communicator arguments, and won't also be debugging inappropriate CommWorld uses. Granted, we can hunt for inappropriate CommWorld uses with "grep" too, but it would be nice to have something that's more natural to toss onto our list of regression test configurations and that automatically checks application-level code too.

friedmud commented 11 years ago

If we implement libMesh::default_comm() then when we use --disable-default-comm we can comment out libMesh::CommWorld all together.

We should never be calling libMesh::CommWorld anywhere in the library anymore... everything should use libMesh::default_comm().

benkirk commented 11 years ago

No preference.

But ATM it would have to be comm().rank() and comm.size()

Personally I like the clarity of n_processors() and processor_id(), but that's just a preference.

-Ben

On Apr 4, 2013, at 10:22 AM, "Derek Gaston" notifications@github.com<mailto:notifications@github.com> wrote:

I actually agree Roy. In MOOSE I believe I'm going to name the member _comm... so that our users can do _comm->send(), etc.

this->comm->n_processors() is not bad at all. But for some reason this->communicator()->send() feels like it's too long ;-)

— Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/issues/71#issuecomment-15903840.

friedmud commented 11 years ago

BTW - I'm not against leaving n_processors() and processor_id() where they are in ParallelObject...

pbauman commented 11 years ago

On Thu, Apr 4, 2013 at 10:36 AM, Benjamin S. Kirk notifications@github.comwrote:

Personally I like the clarity of n_processors() and processor_id(), but that's just a preference.

FWIW, +1.

roystgnr commented 11 years ago

I think "communicator methods all use the communicator" is clearer than "some communicator methods use the communicator, others don't". But that's not a strong preference, and anyway it sounds like I'm outvoted.

roystgnr commented 11 years ago

If everything should use libMesh::default_comm(), then we almost might as well rename "default_comm()" to "CommWorld", no? If we make its --disable-default-comm implementation be some class with an implicit conversion-to-Communicator operator (which throws an error) then we don't even need parentheses or virtual functions to use it as a default argument to methods expecting a Communicator&.

friedmud commented 11 years ago

Sounds fine to me!

friedmud commented 11 years ago

Actually - no it doesn't. Because you wouldn't get the error until you try to use it. I want the error to happen right when we try to assign it.

roystgnr commented 11 years ago

The one catch is that we then need some way for standard only-playing-with-the-world-communicator applications to initialize their stuff; "Mesh(CommWorld);" would break. Is LibMeshInit a ParallelObject yet? If so then we could just use the same API to grab its communicator.

roystgnr commented 11 years ago

The implicit conversion-to-Communicator operator would be called on assignment; it would throw a run-time error when that happened. You'd get a compile-time error for any other attempt to use CommWorld as a communicator.

friedmud commented 11 years ago

Hmmm.... as long as the stack trace is clean and shows where the problem is... I'm ok with that.

friedmud commented 11 years ago

Why would Mesh(CommWorld) fail? It would only fail in the case that you use --disable-default-comm right? That's what we want, right?

jwpeterson commented 11 years ago

Testing to see if "at-tags" will help issue emails make it through to @libmesh-devel

roystgnr commented 11 years ago

I'll put together a "FakeCommunicator" class implementation sometime today or tonight, then, and we can try it out. The stack trace should work just fine.

We do want Mesh(CommWorld) to fail in the --disable-default-comm case, and we also want Mesh() to fail... but we want some way to create a mesh without failing. My proposal was "Mesh(libmeshinit.communicator());" Expresses what we'd want to do when creating a mesh in main(), but because LibMeshInit objects usually aren't in scope anywhere else it'd be hard for app codes to "abuse" and thereby inadvertently make themselves less extensible to the multi-communicator case.

roystgnr commented 11 years ago

What's the proper git procedure for this? Create a roystgnr/libmesh/fake_comm branch, regularly "pull --rebase" from benkirk/libmesh/communicators, then set up a pull request back to benkirk/libmesh/communicators when I'm done?

libmesh-devel commented 11 years ago

On Apr 4, 2013, at 11:07 AM, roystgnr notifications@github.com wrote:

What's the proper git procedure for this? Create a roystgnr/libmesh/fake_comm branch, regularly "pull --rebase" from benkirk/libmesh/communicators, then set up a pull request back to benkirk/libmesh/communicators when I'm done?

Yes, I think so. Pretty sure Derek forked from my branch and then sent me a pull request when it was done?

friedmud commented 11 years ago

I didn't fork your repo (in the Github sense).

In my local repo that is cloned from my libMesh fork I added Ben's libMesh fork as a remote:

git remote add ben git@github.com:benkirk/libmesh.git

Then fetched to get Ben's branches:

git fetch ben

Then created a local tracking branch of his communicators branch:

git co -tb communicators ben/communicators

Then you can push that branch to your own libMesh fork repo on Github by doing:

git push origin communicators

(Note: That assumes that your clone was made from your own libMesh fork... so "origin" here is referring to your own libMesh fork on Github)

Then issue a pull request to Ben's repo and branch here:

https://github.com/libMesh/libmesh/pull/new/benkirk:communicators...master

The only weirdness there is that you have to go to the main libMesh/libmesh repo page and hit "Pull Request" to do a pull request to Ben's fork of the libmesh repo. There is no "Pull Request" button on Ben's repo because you never technically forked Ben's repo. But, the pull request page for the main repo allows you to select that you want to do a pull request to any of the forked repos (including Ben's).

One last thing.... I never issue pull requests for branches I'm actively working in. So what I actually did, is when I was ready to give Ben back my changes I made another branch from my local tracking branch and actually pushed that to GitHub instead. Like so:

/// Currently in my "communicators" branch that I made like above git co -b new_stuff_for_ben git push origin new_stuff_for_ben

Then issue the pull request from new_stuff_for_ben to Ben's communicators branch.

friedmud commented 11 years ago

@roystgnr I was actually thinking that anyone working with --disable-default-comm would actually create their own Communicator object from MPI_COMM_WORLD (or whatever they want to use) and pass that in to Mesh. ie they would literally do that in main()

I don't think we want to do anything for you if you use --disable-default-comm.... that should be a very pedantic mode.

BTW - I'm planning to actually use that mode as the default for MOOSE. MOOSE now relies on the ability to run on sub-communicators.... so we never want to accidentally pick up COMM_WORLD.

benkirk commented 11 years ago

On Apr 4, 2013, at 11:50 AM, Derek Gaston notifications@github.com wrote:

I didn't fork your repo (in the Github sense).

In my local repo that is cloned from my libMesh fork I added Ben's libMesh fork as a remote:

If y'all would prefer I can instead push my branch to libMesh instead of my own space so we can all write to it, but that seems pretty svn'ish of me...

friedmud commented 11 years ago

Nah - this is really the right way to work with it. This is the whole point of "distributed" VC...

benkirk commented 11 years ago

On Apr 4, 2013, at 12:38 PM, Derek Gaston notifications@github.com wrote:

Nah - this is really the right way to work with it. This is the whole point of "distributed" VC...

Back to the whole DefaultCommunicator business, though -

If we do it right, shouldn't we be able to remove the default constructor argument from everywhere except Mesh, and it should all work by design, right? The constant reference inside each ParallelObject is forced to be initialized by the containing object, and that unwinds all the way to the top mesh.

So the derived meshes (Serial, Parallel) are the only place where we refer to any sort of default communicator, and the rest falls through. Ok, and the e.g.

NumericVector<>::build() functions as well, those are the only places we should need to provide a default communicator, and only for backwards compatibility.

Basically I'm just verbalizing that right now there are way more instances of

(const libMesh::Parallel::Communicator& = libMesh::CommWorld )

than are actually required, thereby increasing the likelihood for mistakes.

-Ben

roystgnr commented 11 years ago

Email reply on this ticket just broke for me... is John still futzing with things?

Pasting to the web form:

Ben, your claim intrigues and confuses me.

To be able to use multiple communicators, we need to be able to pass non-default communicators in to ParallelObjects, except for ParallelObjects whose constructors already all require another ParallelObject subclass (e.g. EquationSystems always gets built with a Mesh) or whose parallel_only() methods already all require a ParallelObject subclass or something with a handle to one (e.g. EstimateError subclasses always get handed a System to work with)

To keep this from breaking old code, this argument has to have a default value.

What am I missing?

Am I just agreeing with you unintentionally, because it turns out that most of our classes do get passed other ParallelObjects before they do any parallel work?