koka-lang / koka

Koka language compiler and interpreter
http://koka-lang.org
Other
3.16k stars 153 forks source link

Potential optimization in multi-thread flag check in reference counting #276

Closed raviqqe closed 2 years ago

raviqqe commented 2 years ago

There seems to be a potential optimization in multi-thread flag check by loading refcount fields from kk_header_t with non-atomic load instructions. By this, I guess single-threaded programs written in Koka can potentially be free of atomic instructions. I'm not really sure if the investigation described below is correct. I would also like to hear the reason behind the original decision Koka took.

So I've read through the codes of Koka and Lean 4 related to reference counting. And it seems that Lean 4's reference counting algorithm assume that it's fine to have concurrent access of a non-atomic load operation and an atomic write operation. Links to the corresponding example codes of reference duplication are in the section below.

The assumption Lean is relying on doesn't seem to be valid in general in terms of the C++ standard as it says:

Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.

and:

The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.

But it also says:

Transformations that introduce a speculative read of a potentially shared memory location might not preserve the semantics of the C++ program as defined in this document, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races.

So, if Koka doesn't need to support peculiar architectures or targets, can we potentially replace the atomic load instructions just to check multi-thread bits in reference counts with non-atomic ones? Also, I'm curious to see how much performance impact it has.

Corresponding code locations

Koka

https://github.com/koka-lang/koka/blob/b1670308f88dd1fc6c22cad28385fcb185d5b27d/kklib/include/kklib.h#L609

https://github.com/koka-lang/koka/blob/b1670308f88dd1fc6c22cad28385fcb185d5b27d/kklib/include/kklib.h#L250

Lean 4

https://github.com/leanprover/lean4/blob/484e5102216badf51e3738242c45c982895f8d6c/src/include/lean/lean.h#L398

https://github.com/leanprover/lean4/blob/484e5102216badf51e3738242c45c982895f8d6c/src/include/lean/lean.h#L418

anfelor commented 2 years ago

Hi @raviqqe, great to see you here! Your Pen language looks very interesting!

Regarding the optimization, this Intel blogpost discusses that idea and concludes that it should not be done because the C compiler might introduce stores: Because it does not know that the refcount is thread-shared it could write an old value back! In the comments Arch D. Robinson points out that the C++11 standard, paragraph 1.10.22 prohibits that behaviour, but to me it still seems quite risky to rely on that.

daanx commented 2 years ago

Thanks for looking into this! Any contributions to Koka are always appreciated and the kklib support library has been written with much care :-) I can add two more observations:

  1. On most architectures, relaxed atomic loads and stores are just like regular loads and stores; when you look at the generated assembly on x64 or arm64, the exact same instructions are generated whether we use a relaxed atomic load or just read it normally. A relaxed load/store just guarantees that the whole value is read/written as one atomic unit. (This is already the case on most architectures but you could imagine an embedded cpu that read/writes 32 bits as two 16 bit chunks and in that case you need to handle it specially.)

  2. For Koka/Lean we have a special "monotonic" property for reference counts: if the object is not (thread) shared, the refcount is always >= 0 and it is explicitly made negative once it can be shared. So, we can safely read the refcount and if it is >= 0, we can just increment/decrement it without needing (non-relaxed) atomic operations (as it can't become shared in-between the read and the inc/dec). This is the trick that makes refcounting very fast in Koka and Lean. When designing a language, make sure that potential thread sharing of objects can be detected (in Koka: global values and the captured values when spawning a task).

Best, Daan

ps. I guess technically, the Lean code should actually also use relaxed atomic load/stores: if running on a hypothetical CPU architecture that does not load/store a 32-bit value as one whole value, then when an object is shared, we could imagine that a concurrent access could potentially mess up the actual bits that are read non-atomically. But then, I have not seen such system in practice.

raviqqe commented 2 years ago

@anfelor Thanks for your quick reply and a compliment!

I see. So we shouldn't rely on the behavior in general even if it works with the current compiler and infrastructure. And in this particular case, for example, the compiler might write the entire 64-bit header value on modification of parts other than a refcount field, which causes potential data races although it's explicitly prohibited by the C++ standard? Or, more in general, compilers can use the memory location to store temporary data.

@daanx Thanks for sharing your knowledge! That's really helpful.

For Koka/Lean we have a special "monotonic" property for reference counts

Correct me if I'm wrong. But, as for as I know, Lean's reference count is currently not monotonic in this sense. The counts are just flipped to negative values, I think. For examples,

if running on a hypothetical CPU architecture that does not load/store a 32-bit value as one whole value, then when an object is shared, we could imagine that a concurrent access could potentially mess up the actual bits that are read non-atomically.

So, given OoO and speculative execution, I think the current implementation of Lean 4 can lead to wrong behavior even if the CPU writes count values "atomically." For example, a non-atomic duplication (increment) operation can occur before one atomic drop (also increment) operation on references to a thread-shared memory block, which causes the memory block's release.

Anyway, this is more like an issue or potential improvments on the Lean's side. So I'm gonna close this issue. Thank you!

daanx commented 2 years ago

Hi @raviqqe , thanks for the reply. I mean "monotonic" not in the sense of increasing integers, but as in it is "unshared" first, and at some point can become "shared". If you read a unshared ref count (=positive) then you can handle it non-atomically without worrying about concurrent access. So, just before making an object thread shared, you can safely make the refcounts shared (=negative); only after that atomic operations are required. (so Lean is doing it correctly).

raviqqe commented 2 years ago

Ah, that's my total misunderstanding then! Also apparently, stores for counts can't happen before their loads. So my case described above is invalid anyway. Thanks for the detailed description!