isocpp / CppCoreGuidelines

The C++ Core Guidelines are a set of tried-and-true guidelines, rules, and best practices about coding in C++
http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines
Other
42.5k stars 5.43k forks source link

CP.8 is disturbing #1231

Closed imre-palik closed 6 years ago

imre-palik commented 6 years ago

Would it be possible to replace this rule with something like:

Use atomic loads and stores with memory_order_relaxed instead of volatile

Reason: This describes the intent better, and makes the code well defined instead of implementation-defined

cubbimew commented 6 years ago

Use atomic loads and stores with memory_order_relaxed instead of volatile

That would imply that all uses of the word volatile in other programming languages would map to relaxed atomics in C++, but a simple counter-example is Java, whose "volatile" is a barrier.

Besides, the guidelines generally discourage atomics other than seq_cst (which is what CP.8 uses)

makes the code well defined instead of implementation-defined

nothing implementation-defined about it: it's undefined behavior to race on a volatile int, just like on any other non-atomic int.

imre-palik commented 6 years ago

What does the meaning of volatile in other programming languages has to do with historic (and relatively common) uses of volatile in C and C++?

If you wonder why C matters here, but other langages don't, then please have a look at 10.1.7.1.6 here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4659.pdf

At the same place, 10.1.7.5 claims the semantics of volatile access is implementation defined, not undefined.

cubbimew commented 6 years ago

What does the meaning of volatile in other programming languages has to do with

the first sentence of CP.8 begins "In C++, unlike some other languages, volatile..."

historic (and relatively common) uses of volatile in C and C++?

Historic use of volatile in C, since the day ANSI C committee invented that keyword, has been to access MMIO. CP.8 is not about how to write a device driver (that's CP.200), CP.8 warns specifically about erroneously applying a concept used by other programming languages to C++ (or to C for that matter). The bad example shown in CP.8 has undefined behavior, not because it's a volatile access, but because it's a data race.

imre-palik commented 6 years ago

Historic use of volatile in the context of concurrency: https://elixir.bootlin.com/linux/v4.9/source/include/linux/compiler.h#L226

I saw similar use in various codebases, written in C++.

As for other languages, I stilll don't see the relevance here. I thought this guide is about writing code in C++, not about mapping constructs from other languages to C++. So usage of a keyword in existing, legacy C++ should definitely be more important, than how that keyword is used in other languages.

hsutter commented 6 years ago

CP.8 is correct, and it's an essential guideline exactly because some people don't know it yet. :) volatile was never a good idea for concurrency (inter-thread synchronization), even if some compilers/systems tried to make it work back in the days before we had atomic which is the right tool for that job. Note that even in C, volatile should not be used for concurrency, that's why C11 added _Atomic.

For full details about this issue, please see (and please tell the world to read) my 2009 article "volatile vs. volatile".

hsutter commented 6 years ago

Also, I was just made aware of this true gem: http://isvolatileusefulwiththreads.com

Thanks, Jonathan!

imre-palik commented 6 years ago

I was not arguing that volatile should be used for concurrency. But in the light of existing pre C*11 codebase, that successfuly uses volatile for concurrency, I find the Reasons part of the rule disturbing.

jwakely commented 6 years ago

For some value of "successfully"

imre-palik commented 6 years ago

I would be really happy if all concurrent code would work as well as the core parts of the Linux kernel.

jwakely commented 6 years ago

Unless I'm mistaken, the kernel code you showed isn't using volatile for concurrency, it's using it to prevent the compiler from reordering reads and writes.

 * Their two major use cases are: (1) Mediating communication between
 * process-level code and irq/NMI handlers, all running on the same CPU,
 * and (2) Ensuring that the compiler does not  fold, spindle, or otherwise
 * mutilate accesses that either do not require ordering or that interact
 * with an explicit memory barrier or atomic instruction that provides the
 * required ordering.

N.B. "all running on the same CPU" i.e. not concurrent, and "do not require ordering" or interact with accesses that are ordered using other (correct) techniques.

The volatile is there to ensure reads and writes happen exactly as intended, not to be correct for concurrency.

MikeGitb commented 6 years ago

We all know that volatile variables are not thread safe according to the iso cpp standard, but can we please not pretend that our code is compiled by a document or that people didn't write multithreaded c++ programs prior to c++11, just because the ISO c++ standard didn't have a proper memory model back then?

On msvc volatile actually does synchronization by default (you can turn that behavior off) and IIRC writes and ready to (properly aligned) volatile int and similar are equivalent to relaxed atomic loads and stores on gcc -- at least for x86 (in theory atomic accesses can be optimized better, but in practice that isn't done last time I checked).

That is no reason to use volatile if you have atomics but why do we need to pretend that any use of volatile in the context of threading was automatically an error?

jwakely commented 6 years ago

Sure, some MSVC codebases used volatile for concurrency, but the comments above refer to "pre C*11 codebases", not MSVC codebases. The Linux kernel isn't built with MSVC!

https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html

CaseyCarter commented 6 years ago

The Linux kernel isn't built with MSVC!

Hey, we don't do type based alias analysis - I bet Linus would love MSVC!

imre-palik commented 6 years ago

This is the relevant paper you might want to read: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r5.html The code I linked above is part of the implementation of READ_ONCE() .

While reading the above mentioned paper, I had to realise a few things:

1) volatile is made to work with mmio registers. But this implies atomicity at least in case of the sizes and alignments supported by mmio registers on the given platform. This assumption is what legacy code uses for multithreading.

2) According to Section 4.7 of the standard, using volatile non-atomic access, without any other synchronisation is undefined behaviour. I hope that 10.1.7.5 trumps this, and I am relatively sure that most compilers won't do anything crazy here, because they try not to break legacy code base.

3) If I am reading Section 4.7 properly, then you cannot replace legacy volatile loads with memory_order_relaxed atomic loads, you would need memory_order_relaxed volatile atomic loads to be on the safe side of the standard. (Sorry, I was wrong there.)

So that's why we cannot have nice things.

Accidentally this also means that CP.8 is at most 90% correct, and the reasoning given there is disturbing.

I would recommend rephrasing the rule to something like this:

Try to avoid using volatiles for synchronisation

Reason: When used for synchronisation, its meaning is very subtle, and well beyond the understanding of mere mortals.

Note: If you absolutely have to use volatiles for synchronisation, then wrap it with a layer of abstraction. So that mere mortals won't get confused by it.

hsutter commented 6 years ago

Please, see my 2009 article "volatile vs. volatile". I cover all of this, including that volatile accesses have never been guaranteed to be atomic.

This is not correct:

volatile is made to work with mmio registers. But this implies atomicity at least in case of the sizes and alignments supported by mmio registers on the given platform.

No, volatile never has anything to do with atomicity, only with ordering: to disable all compiler inventing/eliding/reordering of volatile reads and writes . Unrelatedly, on your hardware the minimum size of an mmio read/write happens to be at least the size of an mmio register and so is atomic, and that's fine, but that atomicity would be true even if you had omitted the volatile.

Try to avoid using volatiles for synchronisation Reason: When used for synchronisation, its meaning is very subtle, and well beyond the understanding of mere mortals.

No. volatile cannot be used for inter-thread synchronization, full stop. Trying to do so is not just subtle, it is subtly broken and your code does not actually work, only you may not notice at first because the failure mode is "compiles and appears to work" as usual for a data race.

jwakely commented 6 years ago

P0124R5 explicitly says "please note that these macros are defined to work properly only for properly aligned machine-word-sized variables" and that's how atomicity is guaranteed for those accesses, it has nothing to do with volatile, nothing at all.

According to Section 4.7 of the standard, using volatile non-atomic access, without any other synchronisation is undefined behaviour.

I'm not sure what you're referring to precisely (because that statement isn't true in general, only if there are potentially concurrent conflicting non-atomic accesses) and that's the same for non-volatile non-atomic access, i.e. volatile makes no difference.

I hope that 10.1.7.5 trumps this

10.1.7.5 is [dcl.type.class.deduct] but I don't think that's the subclause you're referring to.

imre-palik commented 6 years ago

When I last worked with mmio registers (way before hsutter's article) it was widely assumed that properly aligned, register-sized volatile mmio reads/writes are atomic. (And as far as I know, nobody died because of that piece of code.) As the compiler has no idea if that given load/store goes to a real mmio register, or to memory, the compiler has to provide the same atomicity guarantees for memory reads/writes as it does for mmio reads/writes. So I'd say that in preC*11 code, people have the expectation that volatile loads and stores should map to a single instruction whenever the hardware supports that. And this is the otherwise all hell would break out in the physical world kind of expectation. As far as I understand, this expectation maps to memory_order_relaxed.

Because of the "otherwise all hell would break out in the physical world" clause, I think, this is a somewhat stronger assumption, than the "Our compiler alway played nice, and compiled register-sized accesses to a single instruction". Especially, because without the volatile, you cannot always be sure that the load/store actually hits the memory. Or even that it will eventually hit the memory that [atomics.order] seems to require for atomics.

BTW, when I first read hsutter's article int 2009, I took it for face value without thinking much about it. But in the last five years, I started seeing all sort of concurrent algorithm implementations, that used volatiles. So I started wondering how this can work. Anyway, it seems, there is a huge codebase out there that uses volatiles for concurrency. And I am way more humble to assume that all of it is crap. That's why I find the reasoning of CP.8 disturbing. It should refer to existing C++ codebase, instead of other languages.

cubbimew commented 6 years ago

As far as I understand, this expectation maps to memory_order_relaxed.

Are you suggesting that the authors of device drivers that use MMIO should switch from combining volatile with pgprot_noncached/ioremap_nocache/however-their-OS-disables-caching to using atomics and memory_order_relaxed? It might actually work.. but it's very much not the topic of CP.8.

In the last five years, I started seeing all sort of concurrent algorithm implementations, that used volatiles

do you have examples? As described, they either don't work (and CP.8 explains why) or are MSVC-only non-portable.

MikeGitb commented 6 years ago

@imre-palik: Just to be clear: Volatile does not provide any atomicity guarantees at all, it just prevents the compiler from adding or removing any read/writes. However, a write to a properly aligned int is atomic on almost all platforms I've worked on (with the exception of some 8 bit microcontrollers) - but mind you that an increment on a volatile variable is not atomic while it is on std::atomic_int. Anyway, that gives you two of three properties you usually need in MT. The third - memory ordering - needs to be added manually (e.g. by a memory fence intrinsic)

My gut feeling is also that more people use volatile because the have used/seen it in pre c++11/c11 code than people that use it because the assume they work the same as in e.g. java.

However: Does it really matter what reason is given here as long as you agree with the conclusion? There are imho far more serious problems in this document.

hsutter commented 6 years ago

To be clear, there is an important problem that has not been mentioned on this comment thread, but is mentioned in my volatile vs. volatile article: Volatile only prevents reordering volatile accesses w.r.t. each other, it does not prevent reordering ordinary memory accesses with a volatile access.

That totally breaks volatile for inter-thread communication in even some of the simplest examples:

Example 1: aligned short ints x and y, initially x==0 && y==0

Thread 1 Thread 2
x = 1;
y = 1;
if (y==1) assert(x==1);

Example 2: widget* p, initially p==nullptr

Thread 1 Thread 2
tmp = new widget();
p = tmp;
if (p) use(*p);

Q: Are these examples correctly synchronized, if y and p are: (a) volatile? (b) atomic?

(a) No. Example 1's assertion can fail (note this would be true even if x were also volatile! see below), and Example 2 thread 2 can see a partly-constructed object. [mutter Yes there was a time when MSVC tried to make these work but it only worked for certain examples and was the wrong thing for us to do and we've fixed it do not attempt this mutter]

(b) Yes.

volatile does not work for inter-thread communication.

A final attempt to skate by and "make it work"...

Question: Let's say you try to make volatile "work" for inter-thread synchronization by doing all of the following, very carefully:

In that very restricted case, can you then do very simple communication between threads using two volatile variables only, because you know you're getting atomicity via natural alignment and you're getting ordering w.r.t. the volatile accesses (only) and you're being careful not to look at any other variables?

Answer: No, not even then, because the first bullet only applies to compiler reordering. volatile does not cause the compiler to generate any special ordered memory instructions, and so because the instruction stream seen by the hardware contains only ordinary memory operations, volatile in the source code does not prevent the processor or cache effects from effectively reordering volatile accesses at the hardware level later. (Of course, the processor wouldn't do that for special addresses it recognizes otherwise volatile wouldn't work for things like mmio, but for ordinary memory the processor has no idea it was expected to do something special.)

The only case I can see where you might be able to skate by in this way without writing a race condition and undefined behavior is if there is only a single volatile involved in the communication (and nothing else). But it can't work when using even two volatile variables (and nothing else) together across threads.

imre-palik commented 6 years ago

I cannot give any real-life examples, other than the Linux kernel. As they tends to be propriatery, closed source stuff. You can find some text-book examples in my favorite book on parallel programming: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

As for the reordering: If you need some safeguards against reordering, you have to fence your accesses. As far as I understand, you have exactly the same issues with memory_order_relaxed loads/stores as with volatiles. So claiming that you don't have this issue with atomics is at most 84% true. But all this is parallelism 101, isn't it? (Why do you assume I don't know the basics?)

BTW, perfbook seems to aggree with me on volatiles: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git/tree/count/count.tex#n541

imre-palik commented 6 years ago

@MikeGitb I think it matters, as it puts the poor soul who has to maintain such a piece of code on a completely wrong track when he tries to understand what is happening. I think, the advice that sounds like, whoever wrote this code that you have to maintain, was obviously an idiot, is almost always wrong.

imre-palik commented 6 years ago

Another possibly important point is the difference between atomic access and volatile atomic access. I don't have a good understanding of it, and I could definitely use some good advice around it.

And again, this rule is far from being helpful there. Esp. that my current understanding is that most of the time I would like to have volatile atomic access instead of atomic access. And the reasoning of don't do it because of java somehow doesn't cut it ...

jwakely commented 6 years ago

Most of the time you just want atomic access. You only need volatile atomic access if you want to use atomics with MMIO. Personally I think the volatile overloads in std::atomic were a mistake, they're 50% of the interface and needed 0.001% of the time.

hsutter commented 6 years ago

@imre-palik I covered that in my article too, in the last sentence:

Finally, to express a variable that both has unusual semantics and has any or all of the atomicity and/or ordering guarantees needed for lock-free coding, only the ISO C++0x draft Standard provides a direct way to spell it: volatile atomic.

This is such a rare case that I've never heard of an example in the wild, and I don't know whether it actually exists. Here's how to think about it: The variable has to be volatile if it's being used to communicate with something that is "not C++" (typically, hardware), and it has to be atomic if it's being used to communicate with another C++ thread in the same program. It is extremely rare that the same memory location would be used to communicate with both of those kinds of entities at the same time, and I've never come across a realistic example where that was wanted and could actually work, where using only volatile by itself or only atomic by itself was not the correct answer.

hsutter commented 6 years ago

PS: One last note regarding the question...

the difference between atomic access and volatile atomic access. I don't have a good understanding of it, and I could definitely use some good advice around it.

... interestingly, in a Google search for either "volatile atomic" or "atomic volatile" (with quotes), the next-to-top hit as of this writing is my article.

MikeGitb commented 6 years ago

@hsutter: I never used it, but didn't a progress variable technically need to be volatile + atomic, because the compiler was in principle allowed to merge consecutive release stores to atomics? So this:

// global
std::atomic_int progress

for(int i=0; i<101;++i) {
    // some work
    progress.store(i, std::memory_order_release);
}

could be optimzied (if some work doesn't perform synchronization itself) to this:

for(int i=0; i<101;++i) {
    // some work        
}
progress.store(100, std::memory_order_release);

IIRC you discouraged such optimizations in recent standards and I don't know if any compiler ever did such an optimization.

hsutter commented 6 years ago

@MikeGitb That optimization is totally valid. In fact, it's the specific example I give in my "volatile vs volatile" article's summary Table 1 that shows the three major differences between atomic and volatile, the last of which is that the compiler is allowed to optimize the atomic accesses and the example given is "such as combining two adjacent stores to the same location" which is your example.

The reason that it's valid is because we can't construct a program that can tell the difference between your first and second code alternatives. Consider: Even if another thread is polling as madly fast as it can to read progress, it isn't guaranteed to see all (or any) of the interim updates anyway because in any execution the above setting thread could run arbitrarily faster than the polling thread, even if the setting thread actually performed all the updates. In short, there is always a valid execution where the polling thread doesn't see any/all of the updates but the last one, and the principle is that a valid optimization is always allowed to reduce the set of possible executions. So the optimization is simply "as if the calling thread always runs faster than all pollers" which is a valid execution.

If you want a polling thread to see individual updates, you have to do additional synchronization (e.g., to block the setting thread so the setter and the poller take turns, or have the setter push the updates into a message queue and have the poller receive them one by one).

As you say, I don't know that compilers actually optimize atomics like that, but they are allowed to and I don't think I ever discouraged the optimization (I do remember encouraging it).

jwakely commented 6 years ago

If you want a polling thread to see individual updates, you have to do additional synchronization (e.g., to block the setting thread so the setter and the poller take turns, or have the setter push the updates into a message queue and have the poller receive them one by one).

And if you do that, you no longer need a volatile std::atomic, because it doesn't add any value now.

hsutter commented 6 years ago

Again, I think I answered every question raised in this thread in my 2009 "volatile vs. volatile" article. Nevertheless, I totally agree that this stuff is nonobvious and that despite my (and many others') articles about this, there are persistent misunderstandings in this area -- and I think this thread itself demonstrates why CP.8 is so important.

hsutter commented 6 years ago

@jwakely Right. And I should have emphasized explicitly that adding volatile doesn't change the example in any way. In particular, just adding volatile doesn't actually make the individual updates any more visible because the setting thread could just always run arbitrarily fast. And (as you point out) if you do the actual synchronization work you actually need in order to make individual updates visible to a polling thread, you still don't need volatile anyway. volatile is neither necessary nor sufficient, it has nothing to do with inter-thread communication, that's all, which is the point of CP.8.

hsutter commented 6 years ago

Calling @paulmckrcu, author of perfbook (hi Paul!) -- Paul, do you want to update perfbook regarding the following? And feel free to correct if I am misunderstanding/misrepresenting your intent.

The context is that @imre-palik wrote regarding using volatile for inter-thread communication/synchronization:

BTW, perfbook seems to aggree with me on volatiles: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git/tree/count/count.tex#n541

That's not my reading at all, it specifically says the following in that answer:

An upcoming version of the C standard aims to fill this gap, but until then, we depend on the kindness of the \GCC\ developers.

That must have been written before C11, because it is talking about _Atomic, the C11 equivalent to C++11 atomic. So no, perfbook does not agree with using volatile for inter-thread synchronization, it says that "volatile plus explicit barriers" was a blunt instrument back in the pre-atomic area but the right answer is to use atomics once they are available. Which they now are.

paulmckrcu commented 6 years ago

Hello @hsutter and thank you for inviting me to this discussion!

The answer is "it depends", as is so often the case.

There was an attempt to move the Linux kernel's x86 architecture to C11 atomics in 2016 (https://lwn.net/Articles/691128/). This effort succeeded in that David Howells really did boot a Linux kernel using C11 atomics, but failed for QOI reasons (some tens of gcc bugs were filed, if I remember correctly). So the Linux kernel still uses a combination of volatile, inline assembly, barrriers, and coding standards. And, yes, the Linux kernel really does use volatile where C11 would use relaxed, to say nothing of consume. If C++ proposals to emit fences for uses of relaxed come to pass, I expect that the Linux kernel will permanently use volatile instead of relaxed. The Linux kernel also respects things like control dependencies, which seems unlikely to appear soon in C or C++ (but a very scary set of coding standards must be followed to successfully use control dependencies). That said, many projects with a less aggressive approach to performance should have no problem with C11 atomics.

Oh, and some use cases really do require volatile atomics. Without the volatile keyword, a compiler could fuse a pair of relaxed loads to the same location, which might not make real-time programmers at all happy. So I respectively but adamantly disagree with http://isvolatileusefulwiththreads.com/. (Imre Palik pointed this out in this thread last week.) Keying off the the progress-bar example, the real-time programmers would feel completely within their rights to point a video camera at the screen in order to tell the difference between the two programs, and also to add delays to avoid the possibility of running faster than the pollers.

On volatile accesses and tearing, yes, in a 32-bit build of the Linux kernel if you do READ_ONCE() of a 64-bit variable, it will tear. Believe it or not, there is code in the Linux kernel that does this intentionally. (Portability? We have heard of it!) On the other hand, if you really want to annoy the typical Linux-kernel developer, have your compiler invent a lock. Something about deadlock should the variable be used both in process context and by interrupt handlers, no fun at all! So use of C11 atomics in the LInux kernel would need some of the same coding-standard constraints and also have some of the same portability restrictions that volatile has.

Yes, most concurrent algorithms that might use volatile would also need to use other things (for example, the aforementioned inline assembly, barriers, and coding standards). One example of a useful algorithm that could make do with (almost) nothing but volatile is distributed counters, where each thread updates its own variable and the counter is read out by summing all thread's variables (similar to a proposal from Lawrence Crowl). The "(almost)" refers to coding standards needed to ensure that the counters fit into a machine register -- and the limits to portability. That distributed-counter algorithm might not be what you want on an 8-bit microprocessor, but then again, from a purely practical viewpoint, multicore 8-bit microprocessors seem to be a bit on the rare and useless side.

Nevertheless, that quick-quiz answer is in need of an update, and thank you for calling my attention to it. If I don't propose something on this list by the end of the weekend, please nag me.

arvidn commented 6 years ago

@paulmckrcu I'm curious. You suggest the linux kernel use volatile to synchronize threads. Is there an assumed target architecture and assumed lowering of volatile accesses by the compiler built into such code? Does the linux kernel use volatile for all architectures and all compilers? My understanding is that the C++ (and I would assume C) abstract machine would fail to run the linux kernel if that is the case.

paulmckrcu commented 6 years ago

@arvidn Heh! On the one hand, I did not simply suggest that the Linux kernel uses volatile to synchronize threads, I asserted it. On the other hand, I also asserted that the Linux kernel did not use volatile in isolation, but rather in combination with inline assembly, barriers, and coding standards, as in "So the Linux kernel still uses a combination of volatile, inline assembly, barrriers, and coding standards" in my post above. Given that background, let's take your questions one at a time:

Is there an assumed target architecture and assumed lowering of volatile accesses by the compiler built into such code?: For the current mainline version of the Linux kernel, there is not just one assumed target architectures, but rather 24 of them (down from 31 due to pruning of obsolete architectures), namely alpha, arm64, hexagon, m68k, nds32, parisc, s390, um, xtensa, arc, c6x, ia64, microblaze, nios2, powerpc, sh, unicore32, arm, h8300, mips, openrisc, riscv, sparc, and x86. Each such architecture provides the needed inline assembly. Last I checked, inline assembly is outside of both the C and C++ standards.

Does the linux kernel use volatile for all architectures and all compilers?: The Linux kernel uses volatile for all architectures and compilers that it supports. The supported architectures are listed above. The supported compilers are various versions of GCC. With some work, the Linux kernel can be built with LLVM. I occasionally hear rumors of people successfully building the Linux kernel with various proprietary compilers.

My understanding is that the C++ (and I would assume C) abstract machine would fail to run the linux kernel if that is the case.: Not really a question, but why not answer it anyway? The answer depends on the exact definition of "abstract machine". If you mean the abstract machine specified by the C++ (or C) standard, then yes, that abstract machine has never ever been able to run the Linux kernel. Now, the abstract machine did get closer with the advent of C11, but as you can see from https://lwn.net/Articles/691128/, there are (at the very least) some ugly QOI issues remaining. However, if you mean the abstract machine embodied by GCC, then to the contrary, the GCC abstract machine has been able to run the Linux kernel from the get-go.

Does that help, or am I missing the point of your questions?

paulmckrcu commented 6 years ago

And the updated answer to that quick quiz is available at the same link shown above: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git/tree/count/count.tex#n541

This link will decay over time as other things change, but such is life. :-)