Closed raviqqe closed 2 years ago
Hi @raviqqe, great to see you here! Your Pen language looks very interesting!
Regarding the optimization, this Intel blogpost discusses that idea and concludes that it should not be done because the C compiler might introduce stores: Because it does not know that the refcount is thread-shared it could write an old value back! In the comments Arch D. Robinson points out that the C++11 standard, paragraph 1.10.22 prohibits that behaviour, but to me it still seems quite risky to rely on that.
Thanks for looking into this! Any contributions to Koka are always appreciated and the kklib support library has been written with much care :-) I can add two more observations:
On most architectures, relaxed atomic loads and stores are just like regular loads and stores; when you look at the generated assembly on x64 or arm64, the exact same instructions are generated whether we use a relaxed atomic load or just read it normally. A relaxed load/store just guarantees that the whole value is read/written as one atomic unit. (This is already the case on most architectures but you could imagine an embedded cpu that read/writes 32 bits as two 16 bit chunks and in that case you need to handle it specially.)
For Koka/Lean we have a special "monotonic" property for reference counts: if the object is not (thread) shared, the refcount is always >= 0 and it is explicitly made negative once it can be shared. So, we can safely read the refcount and if it is >= 0, we can just increment/decrement it without needing (non-relaxed) atomic operations (as it can't become shared in-between the read and the inc/dec). This is the trick that makes refcounting very fast in Koka and Lean. When designing a language, make sure that potential thread sharing of objects can be detected (in Koka: global values and the captured values when spawning a task
).
Best, Daan
ps. I guess technically, the Lean code should actually also use relaxed atomic load/stores: if running on a hypothetical CPU architecture that does not load/store a 32-bit value as one whole value, then when an object is shared, we could imagine that a concurrent access could potentially mess up the actual bits that are read non-atomically. But then, I have not seen such system in practice.
@anfelor Thanks for your quick reply and a compliment!
I see. So we shouldn't rely on the behavior in general even if it works with the current compiler and infrastructure. And in this particular case, for example, the compiler might write the entire 64-bit header value on modification of parts other than
a refcount
field, which causes potential data races although it's explicitly prohibited by the C++ standard? Or, more in general, compilers can use the memory location to store temporary data.
@daanx Thanks for sharing your knowledge! That's really helpful.
For Koka/Lean we have a special "monotonic" property for reference counts
Correct me if I'm wrong. But, as for as I know, Lean's reference count is currently not monotonic in this sense. The counts are just flipped to negative values, I think. For examples,
if running on a hypothetical CPU architecture that does not load/store a 32-bit value as one whole value, then when an object is shared, we could imagine that a concurrent access could potentially mess up the actual bits that are read non-atomically.
So, given OoO and speculative execution, I think the current implementation of Lean 4 can lead to wrong behavior even if the CPU writes count values "atomically." For example, a non-atomic duplication (increment) operation can occur before one atomic drop (also increment) operation on references to a thread-shared memory block, which causes the memory block's release.
Anyway, this is more like an issue or potential improvments on the Lean's side. So I'm gonna close this issue. Thank you!
Hi @raviqqe , thanks for the reply. I mean "monotonic" not in the sense of increasing integers, but as in it is "unshared" first, and at some point can become "shared". If you read a unshared ref count (=positive) then you can handle it non-atomically without worrying about concurrent access. So, just before making an object thread shared, you can safely make the refcounts shared (=negative); only after that atomic operations are required. (so Lean is doing it correctly).
Ah, that's my total misunderstanding then! Also apparently, stores for counts can't happen before their loads. So my case described above is invalid anyway. Thanks for the detailed description!
There seems to be a potential optimization in multi-thread flag check by loading
refcount
fields fromkk_header_t
with non-atomic load instructions. By this, I guess single-threaded programs written in Koka can potentially be free of atomic instructions. I'm not really sure if the investigation described below is correct. I would also like to hear the reason behind the original decision Koka took.So I've read through the codes of Koka and Lean 4 related to reference counting. And it seems that Lean 4's reference counting algorithm assume that it's fine to have concurrent access of a non-atomic load operation and an atomic write operation. Links to the corresponding example codes of reference duplication are in the section below.
The assumption Lean is relying on doesn't seem to be valid in general in terms of the C++ standard as it says:
and:
But it also says:
So, if Koka doesn't need to support peculiar architectures or targets, can we potentially replace the atomic load instructions just to check multi-thread bits in reference counts with non-atomic ones? Also, I'm curious to see how much performance impact it has.
Corresponding code locations
Koka
https://github.com/koka-lang/koka/blob/b1670308f88dd1fc6c22cad28385fcb185d5b27d/kklib/include/kklib.h#L609
https://github.com/koka-lang/koka/blob/b1670308f88dd1fc6c22cad28385fcb185d5b27d/kklib/include/kklib.h#L250
Lean 4
https://github.com/leanprover/lean4/blob/484e5102216badf51e3738242c45c982895f8d6c/src/include/lean/lean.h#L398
https://github.com/leanprover/lean4/blob/484e5102216badf51e3738242c45c982895f8d6c/src/include/lean/lean.h#L418