Open jsquyres opened 7 years ago
@rhc54 A git bisect shows that 691237801bc1d99ec3f7abebfad6a5c5072eee1e is the first bad commit (i.e., to introduce the segv). The commit is from Wed, Sep 13, and had the commit message "Update to track PMIx master".
Can you have a look?
I notice the following:
$ mpirun --mca btl tcp,vader,self --mca oob tcp ./MPI_Bcast_c
it seems like proc_hostname
was simply never initialized (!) and i fixed that in open-mpi/ompi@b3558f261b53587ec03bd430322927b8614bf22f
i will PR to all branches from now, and fwiw, i am very surprised this did not bite us earlier !
Thanks @ggouaillardet !
It looks like 552c9ca5a08 was in July of 2014. How is it possible that this lack of initialization didn't bite us before now? Does this fix require an immediate release on all branches?
does this have something to do with cutoff
?
iirc, it is enabled by default from v3.0
, and if unused, i suspect proc_hostname
always gets initialized later but not too late.
I can point you to places where the issue might arise, but I have to give a talk and then run for a plane this morning - so it will be tonight. Short story: it is possible that we changed something in the latest PMIx that results in the hostname not being a copy of the stored info, but just a pointer to it. I don't know if that is true, but it should be checked.
@rhc54 Given the hash that git bisect found, I was wondering the same thing. I'll try to take a quick look this morning before the webex.
@jsquyres FWIW: the referenced commit was done just last night - so I assume you meant to cite the commit that the referenced commit fixed?
From an off-issue email: C99 doesn't protect us from uninitialized struct members in this case because it only guarantees that stack struct uninitialized members are zeroed out. This particular instance is an OBJ_NEW
, which is on the heap, not the stack.
This gives rise to a different-but-related question: in https://github.com/open-mpi/ompi/blob/master/opal/class/opal_object.h#L478-L487, we only calloc()
in OBJ_NEW
when memchecker support is enabled. Otherwise, we malloc()
. Is there a reason that we don't calloc()
all the time? Sure, calloc()
is likely slower than malloc()
, but OBJ
instances tend to be small, and we already aggressively freelist items when performance is critical. So -- should we change the back-end of OBJ_NEW
to always calloc()
?
@rhc54 691237801bc1d99ec3f7abebfad6a5c5072eee1e appears to be from approximately 2 weeks ago. That's the commit I was referring to (from https://github.com/open-mpi/ompi/issues/4264#issuecomment-332059192).
@jsquyres I was talking about your comment that https://github.com/open-mpi/ompi/commit/b3558f261b53587ec03bd430322927b8614bf22f was from July of 2014 😄
Just to be clear:
proc->proc_hostname
was never initialized.
proc_hostname
is initialized to NULL) in b3558f2 last nightThe supposition is that perhaps that this bug has been present for 3 years, but was only exposed due to recent changes in PMIx. I took a quick look through 6912378, but I don't see anything obvious that would expose this behavior. It will take a better PMIx expert than me to look at that commit and determine the risk level on all the release branches.
i digged this a bit, and it seems this bites us only in master
.
short story
on rank 1
, vpid 5
is ompi_group_dense_lookup
'ed and ultimately ompi_proc_complete_init_single()
proc_hostname
on v3.0.x
proc_hostname
on master
so far so good, proc_hostname
is retrieved with OPAL_MODEX_RECV_VALUE_OPTIONAL()
,
so it is fine if master
sets it to NULL
. so even if the opal_proc_construct()
constructor does not initialize proc_hostname
, it ends up initialized short after.now on rank 0
, vpid 5
is "retrived" from vpid 1
via ompi_comm_get_rprocs()
, and things get ugly.
on v3.0.x
, proc_hostname
(the real hostname) is packed and unpacked and then set.
but on master
, proc_hostname
(NULL
here) is packed and unpacked, and then dropped,
so on rank 0
, the proc_hostname
of vpid 5
is left uninitialized.
the crash involves both recent PMIx
and inter communicators, so at this stage, i'd rather consider the release branches have a bug that does not bite us, so there is no emergency to fix them.
(and master
has already been fixed, so we're fine)
Just to be clear: you're saying that it looks like the real error is that PMIx is packing/sending the hostname, but then after receiving/unpacking it at the receiver, it is dropped before it is assigned to proc->proc_hostname
?
Also, you're saying that master is "fixed" -- but you mean "master is not seg faulting", right? ...or is PMIx intentionally dropping the hostname? That would seem weird; if we didn't want the hostname, we shouldn't be sending it (or have PMIx send it) in the first place.
not really,
i was just detailing which path leads to a crash, explaining why only master
is affected.
initializing proc_hostname
in the constructor is a way simpler and fixes the current crash (and maybe other current and/or future scenarios)
note all is happening at the ompi
level, and we can leave PMIx
out of the picture.
/ the root cause is proc_hostname
was used uninitialized, and it happens that before the PMIx related commit you mentionned earlier, we were lucky and malloc()
returned zero-ed memory /
but you have a point, since proc_hostname
seems optional at this stage, maybe ompi
could ignore it in ompi_{pack,unpack}_proc()
.
also, the buffer size can easily be optimized by sending each namespace once (instead of sending one namespace per proc).
i will make a proof of concept for this.
on a broader scope, proc_hostname
could also be a pointer to a pool of unique hostname
(currently, when set, proc_hostname
is strdup
'ed once per proc).
if some benefit is expected, i can also make a proof of concept.
@rhc54 and I talked about this on the phone this morning.
I have an alternate proposal: the point of not initializing proc->proc_hostname
is to save space (especially in very large jobs) because we'll potentially strdup()
the same hostname many times -- it's a big waste of space.
How about making a ref-counted string? Perhaps opal_string_t
(or pmix_string_t
)? It can be a super-simple class - just a (char*)
. Its mission in life is solely to strdup()
the string at the beginning of time to internal storage, and then to free()
it in the destructor.
If opal_proc_t
then adds a (opal_string_t*)
to store the hostname (and change the opal_proc_t
destructor to OBJ_RELEASE
the opal_string_t
instead of free()
ing the string itself), process X will only have one copy of that hostname, even if N of process X's peers share the same hostname. The opal_proc_t
could even assign the internal (char*)
from the opal_string_t
to opal_proc_t.proc_hostname
for backwards compatibility (i.e., so that we don't have to update a zillion places across the code base).
This scheme doesn't fully address the scaling issues (e.g., for jobs that span >=thousands of nodes) because each process will still have their own copy of all of the hostnames, but it may allow us to adjust the cutoff
value higher...?
Another idea: perhaps the hostnames can just stay in shared memory. I.e., there would be one set of all the hostnames per host rather than per process.
Indeed, opal_proc_t->proc_hostname
can just point to the hostname in shared memory, but since opal_proc_t
wouldn't "own" that memory, it would not free it in the destructor. Instead, the orted
or whoever owns the shared memory would ultimately free it at the end of time.
This would be an even further memory reduction compared to the above proposal -- instead of having one copy of all the hostnames in each process, there would be one copy of the hostnames per host.
(perhaps ORTE / PMIx already does this kind of thing in non-direct-launch scenarios...?)
Just some thoughts...
FWIW: PMIx uses a shared memory to pass info to clients (direct launched or not) that contains a regular expression reporting the hostname for every proc. We no longer store those individually due to the memory footprint at large scale. We then have an API that lets you request the info - that code gets the regex (from the shmem) and parses it to locally return the answer. However, it means that you are always carrying at least one copy of the hostname in each process.
This represents the minimum memory footprint (at least, for the PMIx server), but does introduce a time cost to retrieve the hostname and may not be a global minimum as we still have copies in the client procs. If that isn't acceptable, we could store the hostname once (50k * 30 char/name isn't too horrible) in the server, and then let clients just have a pointer to that shmem location. If every client is asking for hostnames, this might represent a more global minimum solution.
the point of not initializing proc->proc_hostname is to save space
the initial issue was we do not initialize proc_hostname
to NULL
.
i guess you meant instead
the point of not always retrieving proc->proc_hostname is to save space
i can definitely implement your first proposal as a proof of concept
the second one is even better from a memory consumption point of view.
PMIx does something similar for the dstore
but i am not sure Open MPI can directly use this.
@rhc54 any thoughts ?
I would gravitate to the second proposal except that we don't have PMIx available everywhere (e.g., Cray still hasn't adopted it). The first proposal has the benefit of working everywhere, though it isn't as optimal a solution.
Where PMIx is available, there is no problem with PMIx returning a pointer to the string instead of a copy of it - it's just another info-key passed to "get" and easy to add to the OPAL macros.
I'd prefer proposal 1 as it is simpler and can be implemented entirely within ompi/opal.
I confused about the usage of ompi_comm_get_rprocs. This function is only supposed to be used for inter-communicators (and it is), so I fail to see how this can impact the examples that started this discussion.
IIRC, the Intel test suite always does some communicator creation at the start - we've been burned by that before.
ompi_comm_get_rprocs is only used from MPI_Intercomm_create. As far as I see the MPI_Bcast_c does not call MPI_Intercomm_create. Other communicator creation function take a different path, and are always supposed to complete with ompi_proc_complete_init_single
.
@bosilca the intel
test suite has a "framework" to test a given subroutine with various communicators and datatypes. MPI_Intercomm_create()
is invoked under the hood, and it might require a minimum number of MPI tasks, with 8 MPI tasks, it is definitely invoked.
i finally figured out the reason why ompi_proc_{pack,unpack}
is packing/unpacking proc_hostname
is to flag the local nodes. since proc_hostname
might be NULL
i am not sure this works in all cases.
fwiw, i did not build (yet) a test program that triggers a bug here.
@jsquyres @hppritcha @rhc54 i ended up creating a new framework, i will complete it by tomorrow and issue a PR. at that time, there should be two components
opal_string_t
class so there will be a unique and ref-counted string per hostname vs one (or zero) string per MPI task.
a third component based on shared memory (between orted
and MPI tasks) will hopefully integrate easily within this framework.
i was not inspired when naming this framework, so i simply called it pool
, feel free to suggest a more descriptive name (or your best Star Wars character)
In Cisco MTT runs (and manually), I'm getting a segv in MPI_Finalize with many of the Intel tests in Cisco MTT. I'm honestly not sure where the error is actually occurring -- it looks like a double free of a proc in MPI_Finalize.
Here's an example from
MPI_Bcast_c
-- I see it when running on two servers with as low as NP=8/PPN=4:Somehow it seems like the
proc_hostname
memory is getting freed too early...?