open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

Huge number of leaked symbols when compiling with icc #4099

Open jsquyres opened 7 years ago

jsquyres commented 7 years ago

Are we not specifying some flag to icc properly?

When compiling with icc, 17.0 the symbol name checker test (in make check) finds a ginormous number of leaked symbols. The report is too long to include inline -- see attached:

make-check-symbol-checker-icc-fail.txt

This causes MTT fails with make check, such as https://mtt.open-mpi.org/index.php?do_redir=2468.

rhc54 commented 7 years ago

when you disable-dlopen, lots of symbols are suddenly going to be in our library as everything gets swept up into it. Most of these symbols are coming from libraries you linked against, not from OMPI itself.

jsquyres commented 7 years ago

Yes, but even if we disable dlopen, all those symbols should be private -- and they are with gcc.

Also, I don't know why the test would be showing symbols from dependent libraries (like libm).

@markalle Any ideas?

markalle commented 7 years ago

Well, about the --disable-dlopen I did fix up the nm test so when all those extra MCA symbols get pulled into the main library those are all merely identified as warnings. I think it would be nice to have those names cleaned up too, since that would still be name pollution for people who build --disable-dlopen. But since the main path is to have MCAs as .so files I don't care very much about all the warning symbols.

For the errors shown here I'd agree the nm test should ignore those. You must be right that "icc" is creating those, and in general anything the compiler piles into the executable should be considered legal. The purpose of the nm test is to make sure OMPI isn't doing name pollution that could conflict with a user's application, not to nitpick compiler internals.

I don't know how to be automate that better, but for now I think this just means I should grow the list a little bit of symbols that are considered legal. Want me to make a PR for that?

rhc54 commented 7 years ago

I think you're fighting an unwinnable battle. Every compiler and configure option is going to result in more symbols that have to be white-listed, and you'll have to update it with the release of every new version.

A better approach might be to search for symbols that contain things used in OMPI - e.g., "btl" - but that also has problems.

Bottom line is: while you have a noble goal, it is ultimately unachievable in an automated fashion. At the very least, please stop having this test cause a failure!

markalle commented 7 years ago

I don't think it'll be that hard to automate. If I look at .o instead of .so wouldn't that avoid extra compiler-generated symbols? In that case I'd still start by looking at the symbols in a .so, but also have a list that came from the .o files and filter the former list to only include symbols that also exist in the latter list.

If I only look at symbols with particular pre-selected names the test would miss its whole purpose, which is to catch accidental additions of symbols like create_list() that are guaranteed to conflict with applications.

I agree it's a nuisance when the test produces false failures though. If I changed it to never fail, how hard is it to pull up the output from everybody's MTT test to browse? What I'd do is disable the test until it has a longer history of not producing false failures.

rhc54 commented 7 years ago

I don't know how you make this work, but you are welcome to beat your head against it 😄

I think this test should not be part of "make check", but instead be in MTT and you can run it on your machines (and make it fail) - those of us who don't care are then not impacted by it.

markalle commented 7 years ago

Symbol name pollution isn't just a cosmetic issue though, it's a real bug when an application and the MPI library use the same global symbol. To the extent the test works its errors shouldn't be ignored. When it gives nuisance false positives though, that I totally agree nobody should be forced to deal with.

So I've made a PR to always exit(0) on the belief that it has too many false positives: https://github.com/open-mpi/ompi/pull/4119

This PR also does

  1. uses the automake TAP test system that lest the test print messages beyond pass/fail while still passing. This is something I hope to browse MTT tests for and eventually re-enable the test after it has a longer history of not producing false failures
  2. I fixed the test to only consider symbols that exist in .o files, not .so where the compiler-specific symbols were being added, so I think this shouldn't produce false failures anymore, but we'll give it some time to be sure
pmixbot commented 7 years ago

I think you are missing the intended purpose of "make check" in our repository. Those tests are intended to be a quick method by which someone installing the code can ensure that something isn't broken. The directory isn't intended to be used by developers as a place to test whether internal code is broken - that's why we have all those MTT tests.

This test has nothing to do with whether or not the resulting build is broken. While I agree it isn't a cosmetic issue, it is also true that we don't get complaints about symbol issues on our mailing lists, so this isn't a red hot issue out there.

The test therefore really belongs in the ompi-tests repo so we can run it nightly and see the results in the MTT summary. Once moved there, you can use a controlled configuration to create a "fingerprint" of the symbol report, store that in the repo, and then simply compare the result of each execution (using the exact same configuration) to the fingerprint. It becomes really trivial to detect changes, and then someone can look to see where they came from.

You cannot do that in "make check" because you have no control over the user's environment. So you've made your life 100x harder than it needs to be, and that's why you get all these false positives. The current method will never result in zero false positives because everyone's compiler and configuration differs, and that impacts the symbol results.

rhc54 commented 7 years ago

😆 didn't realize I was still signed in on the userid we were setting up for testing.

markalle commented 7 years ago

Thanks, that makes sense. I don't object to MTT. I've updated the pull request to disable it

rhc54 commented 7 years ago

Keep in mind that you can also drop all that "whitelist" stuff as all you care about is difference from the "gold" fingerprint. Given you now control the configuration/compiler combination, the only change should be coming from something we do.

jsquyres commented 7 years ago

Per Webex discussion 22 Aug 2017: we should remove this test from the critical path (i.e., remove it from make check altogether), fix it, and then put it back. I.e., everyone agrees that the intent of this test is good. But it's causing a bunch of false failures. So it needs to be taken out, evaluated / hardened up (if possible), and (if possible) returned to the critical path.