master: getting segv's in many Intel MPI tests over TCP BTL (double free in MPI_Finalize?)

jsquyres commented 7 years ago

In Cisco MTT runs (and manually), I'm getting a segv in MPI_Finalize with many of the Intel tests in Cisco MTT. I'm honestly not sure where the error is actually occurring -- it looks like a double free of a proc in MPI_Finalize.

Here's an example from MPI_Bcast_c -- I see it when running on two servers with as low as NP=8/PPN=4:

Core was generated by `./MPI_Bcast_c'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000003d3dc7b81c in free () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x2aaaabae0480 (LWP 23389))]
(gdb) bt
#0  0x0000003d3dc7b81c in free () from /lib64/libc.so.6
#1  0x00002aaaaab2d973 in ompi_proc_destruct (proc=0x7131d0)
    at ../../ompi/proc/proc.c:90
#2  0x00002aaaaab2e849 in opal_obj_run_destructors (object=<optimized out>)
    at ../../opal/class/opal_object.h:462
#3  ompi_proc_finalize () at ../../ompi/proc/proc.c:409
#4  0x00002aaaaab331a5 in ompi_mpi_finalize ()
    at ../../ompi/runtime/ompi_mpi_finalize.c:464
#5  0x00000000004022a8 in main (argc=1, argv=0x7fffffffbe98) at MPI_Bcast_c.c:237
(gdb) print *proc
$1 = {super = {super = {super = {obj_class = 0x2aaaaaee2360 <ompi_proc_t_class>, 
        obj_reference_count = 0}, opal_list_next = 0x713a90, 
      opal_list_prev = 0x2aaaaaf08f10 <ompi_proc_list+16>, item_free = 1}, 
    proc_name = {jobid = 1814757377, vpid = 5}, proc_arch = 4291367424, 
    proc_flags = 32768, proc_convertor = 0x6b2a80, 
    proc_hostname = 0x7102a800 <error: Cannot access memory at address 0x7102a800>}, 
  proc_endpoints = {0x0}, 
  padding = "\000 Ra\000\000\000\000\000\001\000\000\000*\000"}

Somehow it seems like the proc_hostname memory is getting freed too early...?

jsquyres commented 7 years ago

@rhc54 A git bisect shows that 691237801bc1d99ec3f7abebfad6a5c5072eee1e is the first bad commit (i.e., to introduce the segv). The commit is from Wed, Sep 13, and had the commit message "Update to track PMIx master".

Can you have a look?

I notice the following:

Running the simple MPI examples (ring, hello) doesn't seem to trigger the error.
The Intel tests seem to trigger the error
I was able to reliably reproduce the issue running with all-TCP:
```
$ mpirun --mca btl tcp,vader,self  --mca oob tcp  ./MPI_Bcast_c
```

ggouaillardet commented 7 years ago

it seems like proc_hostname was simply never initialized (!) and i fixed that in open-mpi/ompi@b3558f261b53587ec03bd430322927b8614bf22f

i will PR to all branches from now, and fwiw, i am very surprised this did not bite us earlier !

rhc54 commented 7 years ago

Thanks @ggouaillardet !

jsquyres commented 7 years ago

It looks like 552c9ca5a08 was in July of 2014. How is it possible that this lack of initialization didn't bite us before now? Does this fix require an immediate release on all branches?

ggouaillardet commented 7 years ago

does this have something to do with cutoff ? iirc, it is enabled by default from v3.0, and if unused, i suspect proc_hostname always gets initialized later but not too late.

rhc54 commented 7 years ago

I can point you to places where the issue might arise, but I have to give a talk and then run for a plane this morning - so it will be tonight. Short story: it is possible that we changed something in the latest PMIx that results in the hostname not being a copy of the stored info, but just a pointer to it. I don't know if that is true, but it should be checked.

jsquyres commented 7 years ago

@rhc54 Given the hash that git bisect found, I was wondering the same thing. I'll try to take a quick look this morning before the webex.

rhc54 commented 7 years ago

@jsquyres FWIW: the referenced commit was done just last night - so I assume you meant to cite the commit that the referenced commit fixed?

jsquyres commented 7 years ago

From an off-issue email: C99 doesn't protect us from uninitialized struct members in this case because it only guarantees that stack struct uninitialized members are zeroed out. This particular instance is an OBJ_NEW, which is on the heap, not the stack.

This gives rise to a different-but-related question: in https://github.com/open-mpi/ompi/blob/master/opal/class/opal_object.h#L478-L487, we only calloc() in OBJ_NEW when memchecker support is enabled. Otherwise, we malloc(). Is there a reason that we don't calloc() all the time? Sure, calloc() is likely slower than malloc(), but OBJ instances tend to be small, and we already aggressively freelist items when performance is critical. So -- should we change the back-end of OBJ_NEW to always calloc()?

jsquyres commented 7 years ago

@rhc54 691237801bc1d99ec3f7abebfad6a5c5072eee1e appears to be from approximately 2 weeks ago. That's the commit I was referring to (from https://github.com/open-mpi/ompi/issues/4264#issuecomment-332059192).

rhc54 commented 7 years ago

@jsquyres I was talking about your comment that https://github.com/open-mpi/ompi/commit/b3558f261b53587ec03bd430322927b8614bf22f was from July of 2014 😄

jsquyres commented 7 years ago

Just to be clear:

Git bisect identified a recent PMIx commit (~2 weeks ago) as where the problem started in Cisco environments: 6912378
@ggouaillardet found that the root cause may be that proc->proc_hostname was never initialized.
- This code was introduced into the code base about 3 years ago in commit 552c9ca5a08.
- @ggouaillardet fixed the proc code (i.e., ensure that proc_hostname is initialized to NULL) in b3558f2 last night

The supposition is that perhaps that this bug has been present for 3 years, but was only exposed due to recent changes in PMIx. I took a quick look through 6912378, but I don't see anything obvious that would expose this behavior. It will take a better PMIx expert than me to look at that commit and determine the risk level on all the release branches.

ggouaillardet commented 7 years ago

i digged this a bit, and it seems this bites us only in master.

short story

on rank 1, vpid 5 is ompi_group_dense_lookup'ed and ultimately ompi_proc_complete_init_single()
- retrieves the proc_hostname on v3.0.x
- does not retrieve the proc_hostname on master so far so good, proc_hostname is retrieved with OPAL_MODEX_RECV_VALUE_OPTIONAL(), so it is fine if master sets it to NULL. so even if the opal_proc_construct() constructor does not initialize proc_hostname, it ends up initialized short after.
now on rank 0, vpid 5 is "retrived" from vpid 1 via ompi_comm_get_rprocs(), and things get ugly. on v3.0.x, proc_hostname (the real hostname) is packed and unpacked and then set. but on master, proc_hostname (NULL here) is packed and unpacked, and then dropped, so on rank 0, the proc_hostname of vpid 5 is left uninitialized.

the crash involves both recent PMIx and inter communicators, so at this stage, i'd rather consider the release branches have a bug that does not bite us, so there is no emergency to fix them. (and master has already been fixed, so we're fine)

jsquyres commented 7 years ago

Just to be clear: you're saying that it looks like the real error is that PMIx is packing/sending the hostname, but then after receiving/unpacking it at the receiver, it is dropped before it is assigned to proc->proc_hostname?

Also, you're saying that master is "fixed" -- but you mean "master is not seg faulting", right? ...or is PMIx intentionally dropping the hostname? That would seem weird; if we didn't want the hostname, we shouldn't be sending it (or have PMIx send it) in the first place.

ggouaillardet commented 7 years ago

not really, i was just detailing which path leads to a crash, explaining why only master is affected.

initializing proc_hostname in the constructor is a way simpler and fixes the current crash (and maybe other current and/or future scenarios)

note all is happening at the ompi level, and we can leave PMIx out of the picture. / the root cause is proc_hostname was used uninitialized, and it happens that before the PMIx related commit you mentionned earlier, we were lucky and malloc() returned zero-ed memory /

but you have a point, since proc_hostname seems optional at this stage, maybe ompi could ignore it in ompi_{pack,unpack}_proc(). also, the buffer size can easily be optimized by sending each namespace once (instead of sending one namespace per proc).

i will make a proof of concept for this.

on a broader scope, proc_hostname could also be a pointer to a pool of unique hostname (currently, when set, proc_hostname is strdup'ed once per proc). if some benefit is expected, i can also make a proof of concept.

jsquyres commented 7 years ago

@rhc54 and I talked about this on the phone this morning.

I have an alternate proposal: the point of not initializing proc->proc_hostname is to save space (especially in very large jobs) because we'll potentially strdup() the same hostname many times -- it's a big waste of space.

Proposal 1

How about making a ref-counted string? Perhaps opal_string_t (or pmix_string_t)? It can be a super-simple class - just a (char*). Its mission in life is solely to strdup() the string at the beginning of time to internal storage, and then to free() it in the destructor.

If opal_proc_t then adds a (opal_string_t*) to store the hostname (and change the opal_proc_t destructor to OBJ_RELEASE the opal_string_t instead of free()ing the string itself), process X will only have one copy of that hostname, even if N of process X's peers share the same hostname. The opal_proc_t could even assign the internal (char*) from the opal_string_t to opal_proc_t.proc_hostname for backwards compatibility (i.e., so that we don't have to update a zillion places across the code base).

This scheme doesn't fully address the scaling issues (e.g., for jobs that span >=thousands of nodes) because each process will still have their own copy of all of the hostnames, but it may allow us to adjust the cutoff value higher...?

Proposal 2

Another idea: perhaps the hostnames can just stay in shared memory. I.e., there would be one set of all the hostnames per host rather than per process.

Indeed, opal_proc_t->proc_hostname can just point to the hostname in shared memory, but since opal_proc_t wouldn't "own" that memory, it would not free it in the destructor. Instead, the orted or whoever owns the shared memory would ultimately free it at the end of time.

This would be an even further memory reduction compared to the above proposal -- instead of having one copy of all the hostnames in each process, there would be one copy of the hostnames per host.

(perhaps ORTE / PMIx already does this kind of thing in non-direct-launch scenarios...?)

Just some thoughts...

rhc54 commented 7 years ago

FWIW: PMIx uses a shared memory to pass info to clients (direct launched or not) that contains a regular expression reporting the hostname for every proc. We no longer store those individually due to the memory footprint at large scale. We then have an API that lets you request the info - that code gets the regex (from the shmem) and parses it to locally return the answer. However, it means that you are always carrying at least one copy of the hostname in each process.

This represents the minimum memory footprint (at least, for the PMIx server), but does introduce a time cost to retrieve the hostname and may not be a global minimum as we still have copies in the client procs. If that isn't acceptable, we could store the hostname once (50k * 30 char/name isn't too horrible) in the server, and then let clients just have a pointer to that shmem location. If every client is asking for hostnames, this might represent a more global minimum solution.

ggouaillardet commented 7 years ago

the point of not initializing proc->proc_hostname is to save space

the initial issue was we do not initialize proc_hostname to NULL. i guess you meant instead

the point of not always retrieving proc->proc_hostname is to save space

i can definitely implement your first proposal as a proof of concept

the second one is even better from a memory consumption point of view. PMIx does something similar for the dstore but i am not sure Open MPI can directly use this. @rhc54 any thoughts ?

rhc54 commented 7 years ago

I would gravitate to the second proposal except that we don't have PMIx available everywhere (e.g., Cray still hasn't adopted it). The first proposal has the benefit of working everywhere, though it isn't as optimal a solution.

Where PMIx is available, there is no problem with PMIx returning a pointer to the string instead of a copy of it - it's just another info-key passed to "get" and easy to add to the OPAL macros.

hppritcha commented 7 years ago

I'd prefer proposal 1 as it is simpler and can be implemented entirely within ompi/opal.

bosilca commented 7 years ago

I confused about the usage of ompi_comm_get_rprocs. This function is only supposed to be used for inter-communicators (and it is), so I fail to see how this can impact the examples that started this discussion.

rhc54 commented 7 years ago

IIRC, the Intel test suite always does some communicator creation at the start - we've been burned by that before.

bosilca commented 7 years ago

ompi_comm_get_rprocs is only used from MPI_Intercomm_create. As far as I see the MPI_Bcast_c does not call MPI_Intercomm_create. Other communicator creation function take a different path, and are always supposed to complete with ompi_proc_complete_init_single.

ggouaillardet commented 7 years ago

@bosilca the intel test suite has a "framework" to test a given subroutine with various communicators and datatypes. MPI_Intercomm_create() is invoked under the hood, and it might require a minimum number of MPI tasks, with 8 MPI tasks, it is definitely invoked. i finally figured out the reason why ompi_proc_{pack,unpack} is packing/unpacking proc_hostname is to flag the local nodes. since proc_hostname might be NULL i am not sure this works in all cases. fwiw, i did not build (yet) a test program that triggers a bug here.

@jsquyres @hppritcha @rhc54 i ended up creating a new framework, i will complete it by tomorrow and issue a PR. at that time, there should be two components

a base one that mimics the existing behavior (a bunch of no-op from the component point of view)
one that uses the (to be created) opal_string_t class so there will be a unique and ref-counted string per hostname vs one (or zero) string per MPI task. a third component based on shared memory (between orted and MPI tasks) will hopefully integrate easily within this framework. i was not inspired when naming this framework, so i simply called it pool, feel free to suggest a more descriptive name (or your best Star Wars character)

open-mpi / ompi

master: getting segv's in many Intel MPI tests over TCP BTL (double free in MPI_Finalize?) #4264

Proposal 1

Proposal 2