Closed benkirk closed 11 years ago
lol - oh god!
I just got all of this working in my own code... the thought of having to change all of that.... sigh. ;-)
Hopefully my hacks for swapping out the libMesh::CommWorld communicator will continue to work while you do this work ;-)
Thanks for reminding me - I think it can if that is a design goal.
Basically each class will need a pointer to its own communicator rather than a reference if it is to be switched behind the scenes... Hmm...
-Ben
On Apr 2, 2013, at 11:20 AM, "Derek Gaston" notifications@github.com<mailto:notifications@github.com> wrote:
lol - oh god!
I just got all of this working in my own code... the thought of having to change all of that.... sigh. ;-)
Hopefully my hacks for swapping out the libMesh::CommWorld communicator will continue to work while you do this work ;-)
— Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/issues/71#issuecomment-15785715.
Is it a design goal to allow each class to switch its communicator behind the scenes? Derek needed to swap CommWorld in and out as a workaround to "fake" the effect of having different per-object communicators; once we actually have real per-object communicators I can't see under what conditions we'd want to swap them in and out.
@roystgnr, what are your thoughts about Parallel::Communicator references vs. Communicator::duplicate() everywhere?
For example, ultimately should the EquationSystems store a reference to it's Mesh's communicator, or should it store its own that is simply setup via Communicator::duplicate()?
I can see benefits to duplicating - the integer space for message tags is then disjoint, but I don't have a feel for the cost any other downsides...
I don't even see benefits to duplicating unless we're writing dangrously asynchronous code - if we have any synchronized points at all in an algorithm we ought to be able to use them to get_unique_tag() rather than tagging with magic numbers in the first place.
On the other hand, I don't see any downsides to duplicating either. AFAIK MPI communicators aren't supposed to have much overhead, and we don't add much more ourselves.
My vote would be to store references rather than to duplicate. I'm happy either way if others disagree.
Well, Jed Brown posted something about this on Google+ a while back:
"V1R2M0 makes MPI_Comm_dup use O(P) memory on each rank. Apparently the regression was caused by fixing something else in PAMI for POWER7... This is yet another reason for responsible management of MPI communicators."
Personally, I would say just have the EquationSystems object reference the MPI Comm from the Mesh. That way calling communicator() will return exactly the same object for all objects associated with that Mesh/ES.
I'll go with references as a start then, easy enough to refactor later if needed.
I've started down the implementation. This is useful for scoping the problem:
grep 'CommWorld\|libMesh::n_processors\|libMesh::processor_id' src/*/*.C include/*/*.h
I'm implementing a ParallelObject base class, which is going to make this very clean:
class ParallelObject
{
public:
/**
* Constructor. Requires a reference to the communicator
* that defines the object's parallel decomposition.
*/
ParallelObject (const Parallel::Communicator &comm) :
_communicator(comm)
{}
/**
* Copy Constructor.
*/
ParallelObject (const ParallelObject &other) :
_communicator(other._communicator)
{}
/**
* Destructor. Virtual because we are a base class.
*/
virtual ~ParallelObject () {};
/**
* @returns a reference to the \p Parallel::Communicator object
* used by this mesh.
*/
const Parallel::Communicator & communicator () const
{ return _communicator; }
/**
* @returns the number of processors used in the
* current simulation.
*/
processor_id_type n_processors () const
{ return libmesh_cast_int<processor_id_type>(_communicator.size()); }
/**
* @returns the subdomain id for this processor.
*/
processor_id_type processor_id () const
{ return libmesh_cast_int<processor_id_type>(_communicator.rank()); }
protected:
const Parallel::Communicator &_communicator;
};
The trick then is Mesh, DofMap, EquationSystems, etc... are all ParallelObjects and share the same communicator by construction.
So you can construct a mesh with its own communicator, and that gets propagated all the way down.
Then mostly a bunch of replacing:
CommWorld --> this->communicator()
libMesh::n_processors() --> this->n_processors()
libMesh::processor_id() --> this->processor_id()
This looks good provisionally - we're not going to get any diamond inheritance patterns out of this plan, right?
Doesn't look like it.
I'm working on this at https://github.com/benkirk/libmesh/tree/communicators if anyone is interested. Pretty tedious but straightforwad.
I've done the mesh and equation system stuff, next major hurtle will be linear algebra data structures.
Ok - I'm going to jump in here. I'll see what I can get done tonight...
Derek
Thanks, but no biggie if you want to hold off, I can probably wrap it up tomorrow...
On Apr 3, 2013, at 7:00 PM, "Derek Gaston" notifications@github.com<mailto:notifications@github.com> wrote:
Ok - I'm going to jump in here. I'll see what I can get done tonight...
Derek
— Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/issues/71#issuecomment-15872966.
As long as you're not working on it tonight (so we're not doubling up) - I'll continue the work tonight and then give you a pull request before I go to sleep.
Awesome - I haven't touched any of the parallel linear algebra stuff, so in particular NumericVector and SparseMatrix being derived from ParallelObject would be great.
On Apr 3, 2013, at 7:05 PM, "Derek Gaston" notifications@github.com<mailto:notifications@github.com> wrote:
As long as you're not working on it tonight (so we're not doubling up) - I'll continue the work tonight and then give you a pull request before I go to sleep.
— Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/issues/71#issuecomment-15873112.
So it may look a little weird, but when NumericVector is derived from ParallelObject it's constructor will need either a parent ParallelObject or a Communicator.
That's where you provide a reference to the containing class (e.g. System) to make sure they share the same underlying communicator.
Thanks!
-Ben
On Apr 3, 2013, at 7:05 PM, "Derek Gaston" notifications@github.com<mailto:notifications@github.com> wrote:
As long as you're not working on it tonight (so we're not doubling up) - I'll continue the work tonight and then give you a pull request before I go to sleep.
— Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/issues/71#issuecomment-15873112.
Yep - I got it figured out. I put in a Pull Request for you (on your Fork).
I only got NumericVectors worked out... damn it's fiddly. It's tough to ferret out all the places where these objects get built and make sure a communicator is getting passed in!
One thing that's not great with my patches is the way I deal with PetscVector::PetscVector(Vec) and EpetraVector::EpetraVector(Epetra_Vector)... those constructors that take native objects and create libMesh objects out of them.
You can technically get the MPI_Comm from both Petsc Vecs and Epetra_Vectors... but I couldn't see a great clean way to get those out and create a Communicator from them on the fly in the constructor (to pass down to NumericVector and ultimately to ParallelObject). I'm sure it can be done (you could do it with a function call for instance... but I didn't know about how the memory would be handled)... but for now I just made them take a Communicator just like the other constructors. Which means you have to be careful when you're creating a libMesh NumericVector from a native object....
Speaking of that, there are two places where we do just that that I couldn't fix up: callbacks for ShellMatrix and Preconditioners. Those aren't done yet because those objects need to be turned into ParallelObjects so we can get the Communicator from them to pass into the NumericVector (see the above paragraph ;-). I did put commented out stubs in there to remind us that that needs to be done.
Anyway... hopefully I didn't hork it up too much. Sorry it wasn't more... it took a bit longer than I thought ;-)
I can do Matrices tomorrow morning if you want. Just let me know.
One more thing... what do you think about being able to do:
./configure --disable-default-communicator
??
It would definitely be useful in tracking down all the places in an existing code where we're accidentally passing in CommWorld and we don't mean to.... not too mention all the places in libMesh itself! ;-)
I was thinking about that... My only worry is all the places we are using it as a default constructor argument - we'll have to change those lines of code conditionally.
Would you then propose something like
#define LIBMESH_DEFAULT_COMM_OBJECT =libMesh::CommWorld
...
explicit
NumericVector (const ParallelType ptype = AUTOMATIC,
const Parallel::Communicator &comm LIBMESH_DEFAULT_COMM_OBJECT);
Should be doable. The other bigge will be fixing parallel_only() to take a communicator.
Let's add a new libmesh_parallel_only() which takes a communicator? That'll help avoid name clashes in the future while still preserving backwards compatibility for the original macro name for a while.
For testing purposes, maybe we could optionally initialize CommWorld to be a Communicator subclass that can be duplicated properly but that
Roy
On Apr 4, 2013, at 8:58 AM, roystgnr notifications@github.com wrote:
Let's add a new libmesh_parallel_only() which takes a communicator? That'll help avoid name clashes in the future while still preserving backwards compatibility for the original macro name for a while.
Sounds good. Also, anything that is a ParallelObject should have
parallel_object_only();
which relies on this->communicator() working, but I haven't started the replacement yet.
For testing purposes, maybe we could optionally initialize CommWorld to be a Communicator subclass that can be duplicated properly but that screams and dies if anyone tries to use it for anything else??
Hmm… something like RemoteElem? Good idea.
In that case ParallelObject would need virtual functions, which is fine but I'd say since it is so short we should make two implementations depending on this option, lest this->communicator() and this->processor_id() (which are everywhere) become virtual function calls all the time.
-Ben
I was thinking something simpler for --disable-default-comm
Just make libMesh::defaultComm(). It normally will return libMesh::CommWorld but if you use --disable-default-comm then it just throws an error.
That way all of the constructors will look like:
Object(Stuff stuff, const Parallel::Communicator &comm = libMesh::defaultComm())
That's not too fiddly. Sure, it doesn't give us compile time errors - but in general we won't be able to do that anyway (because we are adding "comm" as the last argument to a lot of functions that already have default arguments... so we can't just omit the default for comm).
At least this way we won't ever silently be getting a CommWorld slipped in when we're trying to work on a sub-communicator.
Ahh, that makes a lot of sense. I was trying to figure out the best way to comment out libMesh::CommWorld for testing purposes!
BTW, I'm well in to the matrix stuff ATM, and thanks for last night!
(uhh....)
lol - anytime ;-)
Do we really need ParallelObject to wrap n_processors() etc.? Why not just make the member reference a bit terser? "this->comm->n_processors()" isn't too much worse than "libMesh::n_processors()", and we're going to need to be doing "this->comm->send()" etc. anyway so we might as well give that idiom as much brevity as the stupid C++ dependent-name-lookup rules allow.
I actually agree Roy. In MOOSE I believe I'm going to name the member _comm... so that our users can do _comm->send(), etc.
this->comm->n_processors() is not bad at all. But for some reason this->communicator()->send() feels like it's too long ;-)
I've come up with about three different alternatives now to Derek's "libMesh::defaultComm()" idea, none of which are as good as his, which is evidence that he's found the right way to proceed. But there's still something that bothers me: if we don't do something to make "CommWorld.whatever()" unsafe to use, then our --disable-default-comm option will only be debugging constructions with missing communicator arguments, and won't also be debugging inappropriate CommWorld uses. Granted, we can hunt for inappropriate CommWorld uses with "grep" too, but it would be nice to have something that's more natural to toss onto our list of regression test configurations and that automatically checks application-level code too.
If we implement libMesh::default_comm() then when we use --disable-default-comm we can comment out libMesh::CommWorld all together.
We should never be calling libMesh::CommWorld anywhere in the library anymore... everything should use libMesh::default_comm().
No preference.
But ATM it would have to be comm().rank() and comm.size()
Personally I like the clarity of n_processors() and processor_id(), but that's just a preference.
-Ben
On Apr 4, 2013, at 10:22 AM, "Derek Gaston" notifications@github.com<mailto:notifications@github.com> wrote:
I actually agree Roy. In MOOSE I believe I'm going to name the member _comm... so that our users can do _comm->send(), etc.
this->comm->n_processors() is not bad at all. But for some reason this->communicator()->send() feels like it's too long ;-)
— Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/issues/71#issuecomment-15903840.
BTW - I'm not against leaving n_processors() and processor_id() where they are in ParallelObject...
On Thu, Apr 4, 2013 at 10:36 AM, Benjamin S. Kirk notifications@github.comwrote:
Personally I like the clarity of n_processors() and processor_id(), but that's just a preference.
FWIW, +1.
I think "communicator methods all use the communicator" is clearer than "some communicator methods use the communicator, others don't". But that's not a strong preference, and anyway it sounds like I'm outvoted.
If everything should use libMesh::default_comm(), then we almost might as well rename "default_comm()" to "CommWorld", no? If we make its --disable-default-comm implementation be some class with an implicit conversion-to-Communicator operator (which throws an error) then we don't even need parentheses or virtual functions to use it as a default argument to methods expecting a Communicator&.
Sounds fine to me!
Actually - no it doesn't. Because you wouldn't get the error until you try to use it. I want the error to happen right when we try to assign it.
The one catch is that we then need some way for standard only-playing-with-the-world-communicator applications to initialize their stuff; "Mesh(CommWorld);" would break. Is LibMeshInit a ParallelObject yet? If so then we could just use the same API to grab its communicator.
The implicit conversion-to-Communicator operator would be called on assignment; it would throw a run-time error when that happened. You'd get a compile-time error for any other attempt to use CommWorld as a communicator.
Hmmm.... as long as the stack trace is clean and shows where the problem is... I'm ok with that.
Why would Mesh(CommWorld) fail? It would only fail in the case that you use --disable-default-comm right? That's what we want, right?
Testing to see if "at-tags" will help issue emails make it through to @libmesh-devel
I'll put together a "FakeCommunicator" class implementation sometime today or tonight, then, and we can try it out. The stack trace should work just fine.
We do want Mesh(CommWorld) to fail in the --disable-default-comm case, and we also want Mesh() to fail... but we want some way to create a mesh without failing. My proposal was "Mesh(libmeshinit.communicator());" Expresses what we'd want to do when creating a mesh in main(), but because LibMeshInit objects usually aren't in scope anywhere else it'd be hard for app codes to "abuse" and thereby inadvertently make themselves less extensible to the multi-communicator case.
What's the proper git procedure for this? Create a roystgnr/libmesh/fake_comm branch, regularly "pull --rebase" from benkirk/libmesh/communicators, then set up a pull request back to benkirk/libmesh/communicators when I'm done?
On Apr 4, 2013, at 11:07 AM, roystgnr notifications@github.com wrote:
What's the proper git procedure for this? Create a roystgnr/libmesh/fake_comm branch, regularly "pull --rebase" from benkirk/libmesh/communicators, then set up a pull request back to benkirk/libmesh/communicators when I'm done?
Yes, I think so. Pretty sure Derek forked from my branch and then sent me a pull request when it was done?
I didn't fork your repo (in the Github sense).
In my local repo that is cloned from my libMesh fork I added Ben's libMesh fork as a remote:
git remote add ben git@github.com:benkirk/libmesh.git
Then fetched to get Ben's branches:
git fetch ben
Then created a local tracking branch of his communicators branch:
git co -tb communicators ben/communicators
Then you can push that branch to your own libMesh fork repo on Github by doing:
git push origin communicators
(Note: That assumes that your clone was made from your own libMesh fork... so "origin" here is referring to your own libMesh fork on Github)
Then issue a pull request to Ben's repo and branch here:
https://github.com/libMesh/libmesh/pull/new/benkirk:communicators...master
The only weirdness there is that you have to go to the main libMesh/libmesh repo page and hit "Pull Request" to do a pull request to Ben's fork of the libmesh repo. There is no "Pull Request" button on Ben's repo because you never technically forked Ben's repo. But, the pull request page for the main repo allows you to select that you want to do a pull request to any of the forked repos (including Ben's).
One last thing.... I never issue pull requests for branches I'm actively working in. So what I actually did, is when I was ready to give Ben back my changes I made another branch from my local tracking branch and actually pushed that to GitHub instead. Like so:
/// Currently in my "communicators" branch that I made like above git co -b new_stuff_for_ben git push origin new_stuff_for_ben
Then issue the pull request from new_stuff_for_ben to Ben's communicators branch.
@roystgnr I was actually thinking that anyone working with --disable-default-comm would actually create their own Communicator object from MPI_COMM_WORLD (or whatever they want to use) and pass that in to Mesh. ie they would literally do that in main()
I don't think we want to do anything for you if you use --disable-default-comm.... that should be a very pedantic mode.
BTW - I'm planning to actually use that mode as the default for MOOSE. MOOSE now relies on the ability to run on sub-communicators.... so we never want to accidentally pick up COMM_WORLD.
On Apr 4, 2013, at 11:50 AM, Derek Gaston notifications@github.com wrote:
I didn't fork your repo (in the Github sense).
In my local repo that is cloned from my libMesh fork I added Ben's libMesh fork as a remote:
If y'all would prefer I can instead push my branch to libMesh instead of my own space so we can all write to it, but that seems pretty svn'ish of me...
Nah - this is really the right way to work with it. This is the whole point of "distributed" VC...
On Apr 4, 2013, at 12:38 PM, Derek Gaston notifications@github.com wrote:
Nah - this is really the right way to work with it. This is the whole point of "distributed" VC...
Back to the whole DefaultCommunicator business, though -
If we do it right, shouldn't we be able to remove the default constructor argument from everywhere except Mesh, and it should all work by design, right? The constant reference inside each ParallelObject is forced to be initialized by the containing object, and that unwinds all the way to the top mesh.
So the derived meshes (Serial, Parallel) are the only place where we refer to any sort of default communicator, and the rest falls through. Ok, and the e.g.
NumericVector<>::build() functions as well, those are the only places we should need to provide a default communicator, and only for backwards compatibility.
Basically I'm just verbalizing that right now there are way more instances of
(const libMesh::Parallel::Communicator& = libMesh::CommWorld )
than are actually required, thereby increasing the likelihood for mistakes.
-Ben
Email reply on this ticket just broke for me... is John still futzing with things?
Pasting to the web form:
Ben, your claim intrigues and confuses me.
To be able to use multiple communicators, we need to be able to pass non-default communicators in to ParallelObjects, except for ParallelObjects whose constructors already all require another ParallelObject subclass (e.g. EquationSystems always gets built with a Mesh) or whose parallel_only() methods already all require a ParallelObject subclass or something with a handle to one (e.g. EstimateError subclasses always get handed a System to work with)
To keep this from breaking old code, this argument has to have a default value.
What am I missing?
Am I just agreeing with you unintentionally, because it turns out that most of our classes do get passed other ParallelObjects before they do any parallel work?
OK, the time has come. In the subcell integration stuff I'm working I'd like to be able to do
This is a necessary step for being able to split an MPI communicator and execute physics A on one portion of the parallel resource and physics B on another.
My plan of attack is to begin with t he Mesh classes, since that is what I need now, and then propagate to the EquationSystems later.
Basically, the Mesh will have an optional constructor argument that takes a reference to a Parallel::Communicator, defaulting to CommWorld.
Once the mesh is done the EquationSystems will inherit its communicator from the Mesh, so that will be easy, just tedious.
@roystgnr, any other pitfalls I'm not thinking about?
In order to keep things like parallel_only() simple I may need to impose a standard naming convention, like have class_parallel_only() expand to
So here I assume that any class with a communicator object has a method communicator() that returns that object...