Closed eli-schwartz closed 3 months ago
This is a slightly different issue than, but directly related to #12674. See https://github.com/open-mpi/ompi/issues/12674#issuecomment-2243804406 for further details.
Thanks for the report!
@jsquyres in this case, the UB is far more egregious. Code such as
uint64_t key = *((uint64_t*)&tmp);
routinely breaks even on current versions of GCC and Clang, and I'd seriously consider fixing this
In particular, type punning is legal to do, but you have to do it "properly". e.g. memcpy is superior to casting and typically generates the same code as one wanted the cast to produce.
This is the kind of thing that routinely breaks because you did -O2, not because of -flto, but -flto simply causes it to break "even more often". As I noted in the OP:
strict-aliasing issues are always bad but LTO can make them even worse
and it still merits fixing regardless of LTO. Maybe I shouldn't have mentioned LTO in the first place.
@SoapGentoo Your example is incomplete; what exactly is tmp
?
Please let's avoid having this discussion again here.
We use pointers to objects with well defined alignment attribute to hide additional information. In this particular instance, the bit 0 signals if this remote proc structure has been expanded or not: unset
(aka. 0) it is a properly aligned pointer (in which case it points to one of our internal objects), set
(aka 1) the content contains instead the unique identifier (uint64_t) of the target process shifted. As I mentionned in the other discussion we know exactly what we are doing, and we are doing it in purpose.
Again, this is totally unrelated and misdiagnosed as being related. This has nothing to do with function signatures, nor with changing the function signatures of the API/ABI. LTO imposes additional unrelated requirements above and beyond being strict-aliasing clean -- whereas this is about miscompilation with -O2.
The C standard allows you to convert a pointer via type punning. It just says you aren't allowed to use a cast for it, and recommends using memcpy for your type punning needs instead.
Sometimes, a cast for something that cannot validly alias will produce garbage code due to UB. In the cases where it does what you actually wanted it to do, a memcpy compiled by a good compiler such as GCC produces the same machine code as the cast.
memcpy is guaranteed to avoid UB. GCC guarantees your memcpy is efficient.
...
The code in question here is completely inside an internal routine, and changing the mechanism you use for type punning won't affect code outside that function.
Let me give you a simple example why your code is broken here:
#include <stdint.h>
typedef struct {
uint32_t jobid;
uint32_t vpid;
} opal_process_name_t;
uint64_t foo(uint64_t* x, opal_process_name_t* y) {
*x = 0;
y->jobid = 1;
y->vpid = 1;
return *x;
}
uint64_t bar(opal_process_name_t* y) {
return foo((uint64_t*)y, y);
}
In bar()
we cast the opal_process_name_t*
to a uint64_t*
, which then gets dereferenced in foo()
. This compiles to
foo:
mov rax, QWORD PTR .LC0[rip]
mov QWORD PTR [rdi], 0
mov QWORD PTR [rsi], rax
xor eax, eax
ret
bar:
mov rax, QWORD PTR .LC0[rip]
mov QWORD PTR [rdi], rax
xor eax, eax
ret
notice how bar returns 0
(that is xor eax, eax
) even though x
and y
inside foo()
point to the same memory address, and the last write to that memory address comes from y->vpid = 1
, hence bar()
should logicially return 0x100000001
but doesn't, since you've violated the strict aliasing rules.
I believe much of the issue here has been due to miscommunication, at least for this poor mortal. I realize you folks spend many hours/days talking regularly amongst yourselves about these things and have therefore developed a clear understanding of the context surrounding these issues you are trying to raise, but I certainly haven't - and I suspect many of the rest of us are in a similar situation.
Simple example: it took me a few readings to finally figure out that your "UB" didn't stand for "Upper Bound", which is the most common meaning for that acronym. Likewise, I am still trying to fully grok your use of "OP" - which I think means "Other People" as opposed to "OutPatient" (as the medical community has adopted)? Not entirely sure. If you are going to use acronyms freely in your discussion, it helps to at least define them once early on to ensure others understand what you are saying - it's why we require you to do so in, for example, publications.
So after going back and re-reading all the exchanges between the several issues spanning two projects, here is the gist I've tried to assemble.
This entire discussion appears to have been about difficulties in optimizing code. Optimizers have problems correctly interpreting various historically-permitted C operations, and your are saying that the C standard has tried to evolve to make it easier for the optimizers to succeed. This I can grok - over the 50+ years I've been programming, optimizers have always had difficulty ensuring that the resulting opimized code actually resulted in the original behavior. Most commonly, cranking up the -O
value could result in incorrect behavior - and sometimes cranking it up even more would cause it to get the right behavior. I believe this is what you mean by "undefined behavior", yes? In other words, the behavior of the written code is in fact well-defined, but the optimized behavior can vary across optimizers - and so perhaps "undefined optimized behavior" would be a little clearer.
So to summarize, the issue here isn't that some of the code you've referenced isn't "defined" - those code patterns have been allowed by the C standard since its inception and are widely used. What you are raising is that some (many?) optimizers have problems properly parsing those patterns into something that preserves the original result - in other words, they may result in faster code, but the resulting behavior isn't what the programmer intended. You at least imply that the C standard has been modified to perhaps not necessarily forbid them - after all, compilers continue to accept those code patterns, and I rather doubt that is going to change anytime soon as it would break a lot of existing code - but to at least recommend alternative approaches.
Hence, you are suggesting that we utilize different coding patterns that (a) produce the same resulting behavior, but (b) are easier for optimizers to parse while enhancing the likelihood of getting the same behavior out of the optimized code. Correct?
If so, then that is useful advice. I confess that the programming community used to take the position of insisting that the problem lies in the optimizer and not the original code, and so improving/fixing the optimizer is the correct response. However, optimizer developers may have simply "hit the wall" and convinced the C standard community to help by modifying the standard - not all of us watch every move of that body. Could well have missed something, especially if the compilers don't reflect those changes (which, if I'm understanding your comments, they clearly don't at this time since we are not seeing warnings or errors generated by those code patterns across a very wide range of compiler families and versions).
In the cases where it does what you actually wanted it to do, a memcpy compiled by a good compiler such as GCC produces the same machine code as the cast. memcpy is guaranteed to avoid UB. GCC guarantees your memcpy is efficient.
I personally find these kind of statements disturbing. You imply that GCC is always a "good" compiler,, therefore implying that other compilers are not - but that is an opinion not necessarily universally shared, especially across all versions of GCC. Basing an entire project on what someone thinks about GCC vs clang and other compilers would be irresponsible. I think this argument would be more convincing if one could provide a guarantee that "memcpy produces the same machine code as the cast across all optimizers and is at least as efficient as the cast". For people where nanoseconds matter and whose projects must support all compilers, these are concerns.
If all we are doing is making things easier for GCC and some subset of optimizers, then that isn't terribly convincing - especially given the magnitude of the proposed code disruption.
@rhc54 you're 90% there. Let's take the simplest example:
int* p = NULL;
printf("int = %d", *p);
This is probably the simplest case of undefined behavior. Once you encounter undefined behavior, all bets are off and compilers can do whatever they like. This has been the case since the first C standard (C89). Historically, this code would usually segfault (in a hosted environment) but in this day and age, I have seen compilers optimize out the null-pointer dereference (and the printf()
statement too). Simply said, dereferencing a null pointer is undefined behavior and anything can happen.
Back in 1989, compilers were a simpler thing, and only with stuff like SSA and deeper optimization passes did compilers manage to tease out more performance from code. Here's another example:
#include <limits.h>
#include <stdio.h>
int main() {
for (int i = INT_MAX - 2; i >= 0; ++i) {
printf("i = %d\n", i);
}
}
when compiling with -O0
, you get the intuitive output:
$ ./a.out
i = 2147483645
i = 2147483646
i = 2147483647
compiling with -O2
, GCC even warns about the undefined behavior:
$ gcc -O2 ub.c
ub.c: In function ‘main’:
ub.c:5:43: warning: iteration 2 invokes undefined behavior [-Waggressive-loop-optimizations]
5 | for (int i = INT_MAX - 2; i >= 0; ++i) {
| ^~~
ub.c:5:37: note: within this loop
5 | for (int i = INT_MAX - 2; i >= 0; ++i) {
| ~~^~~~
$ ./a.out
i = 2147483645
i = 2147483646
i = 2147483647
<runs an infinite loop>
what's the problem with this code? Signed (but not unsigned!) overflow is undefined behavior, and has been undefined behavior since the very first C89 standard. You're starting at a very large positive value (INT_MAX - 2
) and are incrementing it until it becomes negative. Unfortunately, since signed overflow is undefined behavior, GCC is allowed to assume it never happens, and therefore turns the loop into an infinite loop, since every set of logic steps to stop the loop requires invoking undefined behavior (Clang does this too just FYI). Signed overflow undefined behavior has caused a lot of code to fail back when GCC 4 came out.
The UB we're discussing here is around type punning. Historically, people have treated C as "portable assembly", which it never was. People assumed they could take a float*
pointer, cast it to a uint32_t*
pointer, and then dereference it. This has never been valid, even C89 calls this undefined behavior. The problem people are seeing now is that compilers are becoming more and more aggressive in optimizing, and exposing more undefined behavior that just "happened" to work a few years ago. This is why "this code has worked for 20 years" is such a weak argument, because undefined behavior is insidious and oftentimes manifests only with newer compilers.
Compilers implement the C standard to the word, and most C developers have never read the actual C standard. The example I have shown in https://github.com/open-mpi/ompi/issues/12675#issuecomment-2244618620 is an example of UB where type punning through pointer casts leads to a subtle logic bug in 10 lines of code. As distro maintainers, we are sensitive to these issues, since we have to debug this sort of code every time a new GCC/Clang major version comes out.
I would certainly agree that dereferencing a pointer that went from float to uint32_t is invalid - one can clearly see that is a mistake.
I'll set aside the "portable assembly" comment - I recall sitting through some discussions where that definitely was the focus at the time, but it's somewhat irrelevant here.
I think looking deeper into the code for issues is a fine thing, but we have to give people some means for doing so. I've complained about the large volume of warnings OMPI currently (and historically) generates for quite some time, to little avail as the community hasn't seen corresponding failures and/or performance issues (and people are motivated by failures and performance). In PMIx and PRRTE, I set -Werror
by default for developers and that helps a bit - though it has caused some unhappiness over here as OMPI embeds copies of those codes.
So motivation and means will be issues that merit some thought.
If I correctly grok your comments, it would appear that you would like us to add optimized compiles to the standard testing to try and tease out more of these issues? Is that a correct statement? I know my automated testing does not include -Ox
flags, and so perhaps it might miss some of these things you have raised (sounds like it would have caught examples like the above if -Ox
is combined with -Werror
).
Do you have any suggestions on how we might better surface the problems you raise? The code base is too large for manual review, and if the compilers don't flag the problems, then they won't be found. Is running the code thru -Ox -Werror
sufficient? What x
levels should be used, and are there differences across compiler families/versions?
-Werror
in CI is fine up to a point. That is, you need to use flags that have a guarantee of a 0% false positive rate. -Wall
and -Wextra
in general are safe, but there's a bunch of flags that have (lots in my experience) false positives, such as -Wnull-dereference
, -Wmaybe-uninitialized
and a bunch of others.-Werror
, you'll have to fix the issue here too (i.e. casting a opal_process_name_t*
to a uint64_t*
and then dereferencing it). In general, most cases of -Wstrict-aliasing
can be fixed somewhat easily, since they usually only exist in implementation code, not in interface/API/ABI code. The fixes are simple:
memcpy()
which GCC/Clang/MSVC/new-and-old-Intel all optimize down to efficient machine codeopal_process_name_t
field of a union and reading from a uint64_t
field (importantly, you have to shuttle it through the union, you aren't allowed to alias a pointer to the union).Simple example: it took me a few readings to finally figure out that your "UB" didn't stand for "Upper Bound", which is the most common meaning for that acronym. Likewise, I am still trying to fully grok your use of "OP" - which I think means "Other People" as opposed to "OutPatient" (as the medical community has adopted)? Not entirely sure. If you are going to use acronyms freely in your discussion, it helps to at least define them once early on to ensure others understand what you are saying - it's why we require you to do so in, for example, publications.
My apologies.
OP is a somewhat common term in online forums, which GitHub issues could be considered to be, for "Original Post".
UB is an acronym for an ISO C standards document term, "Undefined Behavior".
This entire discussion appears to have been about difficulties in optimizing code. Optimizers have problems correctly interpreting various historically-permitted C operations, and your are saying that the C standard has tried to evolve to make it easier for the optimizers to succeed.
Not entirely, but similar. Optimizers have problems with "doing what you mean" rather than "doing what you said". The C standard hasn't evolved at all -- the compilers do exactly what the ISO c99 standard text permits them to do, if you compile with -std=c99, but what has changed is that users want compilers to "be better" and "generate faster programs" and compiler authors have obliged by redesigning their compilers to draw more and more inferences about what your code might have meant, and frequently this means pedantically reading the standards text and taking advantage of various forms that the standard has always said is Undefined Behavior to outright delete code that the compiler thinks cannot ever be reached.
Which leads us to...
If so, then that is useful advice. I confess that the programming community used to take the position of insisting that the problem lies in the optimizer and not the original code, and so improving/fixing the optimizer is the correct response. However, optimizer developers may have simply "hit the wall" and convinced the C standard community to help by modifying the standard - not all of us watch every move of that body
Improving the optimizer here isn't an option. The optimizer developers are, correctly, stating that they are allowed by the original published text of the standard to do what they're doing today, and that the current situation is what "improving the optimizer" means, because they have improved it to generate faster code, all the time.
I personally find these kind of statements disturbing. You imply that GCC is always a "good" compiler,, therefore implying that other compilers are not - but that is an opinion not necessarily universally shared, especially across all versions of GCC.
Sorry, to clarify what I mean is that gcc is an example of a good compiler, and so is clang -- I don't happen to personally use clang so I didn't happen to mention it.
I have some skepticism that tinycc is very good, but on the other hand I don't think anyone alive uses tinycc because they think it's good -- rather because they think it is tiny and they like tiny things.
I'm also skeptical about Microsoft's MSVC, in fact I recall someone complaining that MSVC generated several machine code instructions rather than one by using the memcpy approach. But fortunately openmpi doesn't care about that because Windows support was dropped over a decade ago and even if it hadn't been, using MSVC from autotools is kind of painful and using a mingw compilation environment with choice of gcc or clang is pretty common for Windows.
-Werror
in CI is fine up to a point. That is, you need to use flags that have a guarantee of a 0% false positive rate.-Wall
and-Wextra
in general are safe, but there's a bunch of flags that have (lots in my experience) false positives, such as-Wnull-dereference
,-Wmaybe-uninitialized
and a bunch of others.
I don't believe a 0% false positive rate is achievable - if the community decides to pursue this, then they will have to accept some need for triage of the results. Issue this creates is that you cannot use it in CI else we will never pass another PR. The community will have to work out the logistics - perhaps a "non-required" CI would suffice, though that means someone has to look at its output before deciding "pass/fail". Extra burden, may not be acceptable.
2. If you're going with `-Werror`, you'll have to fix the issue here too (i.e. casting a `opal_process_name_t*` to a `uint64_t*` and then dereferencing it). In general, most cases of `-Wstrict-aliasing` can be fixed somewhat easily, since they usually only exist in implementation code, not in interface/API/ABI code. The fixes are simple: * `memcpy()` which GCC/Clang/MSVC/new-and-old-Intel all optimize down to efficient machine code * writing into a `opal_process_name_t` field of a union and reading from a `uint64_t` field (importantly, you have to shuttle it through the union, you aren't allowed to alias a pointer to the union).
I'll leave that for the OMPI community to debate - I can see some of their point too, but it is up to them.
3. Increase "instrumentation" by adding UBSAN and ASAN to catch different kinds of undefined behavior during CI. The issue here likely won't be caught, since it's a violation of the type rules, whereas UBSAN in general only catches "arithmetic" type UB (division by zero, signed overflow, dereferencing null pointers, buffer overflows, etc).
Worth at least giving it a smoke test to see (a) what it catches in the current code, and (b) what problems it creates. I've only had light experience with ASAN (none with UBSAN) and it has had mixed results. Still, worth taking a look at it.
Thanks!
Improving the optimizer here isn't an option. The optimizer developers are, correctly, stating that they are allowed by the original published text of the standard to do what they're doing today, and that the current situation is what "improving the optimizer" means, because they have improved it to generate faster code, all the time.
So having the optimizer output garbage is acceptable to the optimizer developers because "it is faster garbage"? I find it very hard to believe that the community would behave that way. If that truly is their attitude, then perhaps we should discourage people from turning them on 🤷♂️
Seems to me that a better approach would be for the compiler, optimizer, and user community work together to ensure that optimized code is in fact correctly reproducing the intended behavior.
A 0% false positive rate is absolutely achievable. -Wstrict-aliasing
for instance only warns on pointer conversions where it knows with 100% certainty that T*
cannot safely be converted to U*
and then dereferenced. I have not seen a single instance of -Wstrict-aliasing
that was a false positive.
It took us a year, but our corporate codebases have ASAN and UBSAN hits. In fact, I don't ever recall a single ASAN or UBSAN false positive in the 10 years I've been using it (which is also a declared goal). That said, there are still a bunch of false negatives and I have observed these.
I find it very hard to believe that the community would behave that way.
have you read the GCC bugreports I linked to? The problem remains: most C programmers don't know C, but dangerously think they do.
So having the optimizer output garbage is acceptable to the optimizer developers because "it is faster garbage"? I find it very hard to believe that the community would behave that way. If that truly is their attitude, then perhaps we should discourage people from turning them on 🤷♂️
Their viewpoint is that "this is the Garbage In, Garbage Out principle in action -- simply write C code that is valid according to the original, unchanged published text of the standard, and it won't output garbage".
Seems to me that a better approach would be for the compiler, optimizer, and user community work together to ensure that optimized code is in fact correctly reproducing the intended behavior.
That would be a "Do What I Mean compiler". I heard that chatgpt is working to solve the age-old programming task of figuring out what "meaning" means.
Compilers don't currently include an AI component, and try to work with what you said, rather than what you meant. Considering the accuracy problems with AI today, I'm relieved.
...
That being said, there's only two communities here: the compiler (which is the optimizer), and the users. The user community has spent a decade asking the compiler developers to make a faster compiler...
A 0% false positive rate is absolutely achievable.
Depending upon the precise flags you set, and how willing you are to not check for certain problem types. Your example is focused on the one thing you are currently concerned about, but I am sure I can find others who would just as ardently point to some other issue they care about.
have you read the GCC bugreports I linked to? The problem remains: most C programmers don't know C, but dangerously think they do.
I fail to understand how this relates to what I said - I was talking about "the optimizer developer community" and them not caring about generating garbage code.
I think your comment about C programmers is rather prejudicial and somewhat derogatory, so let's perhaps leave such opinions unsaid in the future?
So having the optimizer output garbage is acceptable to the optimizer developers because "it is faster garbage"? I find it very hard to believe that the community would behave that way. If that truly is their attitude, then perhaps we should discourage people from turning them on 🤷♂️
Their viewpoint is that "this is the Garbage In, Garbage Out principle in action -- simply write C code that is valid according to the original, unchanged published text of the standard, and it won't output garbage".
Seems to me that a better approach would be for the compiler, optimizer, and user community work together to ensure that optimized code is in fact correctly reproducing the intended behavior.
That would be a "Do What I Mean compiler". I heard that chatgpt is working to solve the age-old programming task of figuring out what "meaning" means.
Compilers don't currently include an AI component, and try to work with what you said, rather than what you meant. Considering the accuracy problems with AI today, I'm relieved.
...
That being said, there's only two communities here: the compiler (which is the optimizer), and the users. The user community has spent a decade asking the compiler developers to make a faster compiler...
Okay, I've had enough - this has degenerated into something sad. I'll think about some of the good things in the thread and try to ignore where things have fallen off the track. Adios.
Hi, I just want to clarify on this. In general, https://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html is a superb read on "why does the compiler produce bad output if it knows the input is bad", written by the creator of llvm.
So having the optimizer output garbage is acceptable to the optimizer developers because "it is faster garbage"? I find it very hard to believe that the community would behave that way. If that truly is their attitude, then perhaps we should discourage people from turning them on 🤷♂️
The reason is that optimizers do not know that their input is garbage, nor that their output is garbage wrt. the intent of the programmer. Optimizers do not operate on the semantics of the programming language, but on their own internal representation (IR). Compiler IR has it's own ruleset of legal behaviour that is often similar to C - alignment requirements, null pointers, type aliasing etc. A huge part of optimizers is eliminating dead branches and expressions. For these purposes, any branch that exhibits UB is inferred to not be taken. During lowering from an input language like C to compiler IR, the language frontend emits a lot of those branches - it's simply what the idiomatic representation of a C expression looks like in IR. This becomes especially relevant with inlining where combining the body of two functions may reveal a plethora of branches to be invalid. This means that the optimizer cannot discern between "illegal branch because the input is UB" or "illegal branch because extra context from inlining and such revealed it to be so". There have been attempts to propagate the required information from the frontend to the compiler, but they have all ended up being woefully inefficient and incomplete (for example because you lose most high-level type and control flow information when lowering to IR), and semantic analysis on the input language as parsed by the frontend has proven to be much more effective. This is why the optimizer can't just know to not "output garbage" on UB input - it's simply a completely different domain from the language frontend. Instead, it's much easier for the frontend to detect UB before emitting invalid IR, as the frontend retains all the relevant context and specific knowledge about language semantics. This has been the case for well over two decades for all compilers that OpenMPI supports, aswell as for the compilers of "safe" languages like Rust, C# or Java - it's how the vast majority of compilers operate.
Seems to me that a better approach would be for the compiler, optimizer, and user community work together to ensure that optimized code is in fact correctly reproducing the intended behavior.
They do, which is why diagnostics like -Werror=strict-aliasing
, and runtime tools like asan
or ubsan
exist.
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
5.0.3 tarball
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Using Gentoo Portage (I am currently working to upgrade Gentoo from openmpi 4.1.6 to 5.0.3).
Details of the problem
I... had originally tried to build with the following *FLAGS to optimize the build:
-flto=4 -Werror=odr -Werror=lto-type-mismatch -Werror=strict-aliasing
The -Werror=* flags are important to detect cases where the compiler can try to optimize based on assuming UB cannot happen, and miscompile code that has UB in it. strict-aliasing issues are always bad but LTO can make them even worse.
Due to #12674 I cannot configure with LTO. I tried disabling LTO to see if there are any more errors that need to be fixed in parallel to the configure issue.
I got this error:
There are actually a lot more like this. Github won't let me post them all because "There was an error creating your Issue: body is too long, body is too long (maximum is 65536 characters)." Heh. ;)
The remaining errors can be found here: Full build log: build.log