open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.09k stars 846 forks source link

Segv in new monitoring coll #3769

Open jsquyres opened 7 years ago

jsquyres commented 7 years ago

@bosilca @clementFoyer Cisco's MTT is seeing segv's that seem to have to do with the monitoring coll. See https://mtt.open-mpi.org/index.php?do_redir=2454, for example.

This also raises another point: I didn't anything to enable the monitoring components. Are the monitoring components supposed to be enabled by default?

And if not, should I add MTT runs with them explicitly enabled? If so, what's the Right way to enable them all?

clementFoyer commented 7 years ago

It shouldn't be enable by default. It requires at least to set --mca pml_monitoring_enable 1 at launch to start it.

rhc54 commented 7 years ago

We think you may have a bug in your glue that is failing even when monitoring is not enabled

clementFoyer commented 7 years ago

Would it be possible to send a sample for me to check on this issue ?

bosilca commented 7 years ago

It is not enabled in most of the jenkins checks and all test were successful. I can't replicate locally either.

jsquyres commented 7 years ago

Sure -- are you looking for more info than the stack traces that are on MTT? I.e., tell me what you need, and I'll see if I can get it (since it happened at Cisco/MTT).

bosilca commented 7 years ago

I can reproduce.

bosilca commented 7 years ago

f8ffec926ee should fix the issue.