typelevel / cats-effect

The pure asynchronous runtime for Scala

Apache License 2.0

1.99k stars 511 forks source link

We need to be more explicit about the Cats Effect memory model #3899

Open armanbilge opened 7 months ago

armanbilge commented 7 months ago

Currently the Cats Effect memory model is implicitly defined via implementation details and the Java Memory Model (JMM). In some cases this works okay but in other cases this means that our memory model is effectively undefined.

As a concrete example, earlier today on Discord there was a report that Semaphore offers no memory ordering guarantees when acquiring/releasing 0 permits. This boils down to a specific implementation detail that doesn't matter if you are using CE's own concurrency primitives and data structures, but matters a lot if you are reading/writing unsynchronized memory unsafely with delay. We can't say if the current Semaphore implementation is bugged or not without explicitly defining the expected semantics.

With Scala Native multi-threading on the horizon in https://github.com/typelevel/cats-effect/issues/3662, nailing down the specifics of the CE memory model becomes even more important since we have a lot more flexibility in implementation details. Longer-term, Scala Native could potentially even support weaker memory models than the JMM.[^1]

[^1]: Tangential, but I have a hypothesis that code written using immutable data structures and CE concurrency primitives actually does not need many guarantees of the JMM (e.g. publication of final fields on construction) and that a weaker CE memory model would be sufficient and have better performance esp. on ARM.

durban commented 7 months ago

I think it's hard to do this right, but probably worth it. Even a somewhat informal description is probably better than nothing (which is the current situation).

One of the things which makes this hard is that (as far as I know), the Scala memory model is also "implicitly defined via implementation details and the Java Memory Model".

Another thing we should think about is whether to define these rules for the typeclasses in kernel, or only for IO. Defining them for the typeclasses is more useful (I think), but that would also mean that we create new rules that every implementer (ZIO, Monix(?), ...) must follow.

I know it's just the example which triggered this issue, but I'd be interested in what guarantee an acquire(0) or release(0) is expected to provide. As far as I know, a semaphore (intuitively) provides happens-before for corresponding release/acquire pairs. If we're rel/acq-ing zero permits, these are not paired with each other...

(All these concerns doesn't mean we shouldn't do this. I think we should, I just started to think about how to do it...)

armanbilge commented 7 months ago

I know it's just the example which triggered this issue, but I'd be interested in what guarantee an acquire(0) or release(0) is expected to provide. As far as I know, a semaphore (intuitively) provides happens-before for corresponding release/acquire pairs. If we're rel/acq-ing zero permits, these are not paired with each other...

For comparison, here's what the JDK Semaphore promises:

Memory consistency effects: Actions in a thread prior to calling a "release" method such as release() happen-before actions following a successful "acquire" method such as acquire() in another thread.

The OP mentioned that it works even with 0 permits.

durban commented 7 months ago

Memory consistency effects: Actions in a thread prior to calling a "release" method such as release() happen-before actions following a successful "acquire" method such as acquire() in another thread.

I don't think this part of the JDK apidoc (which is usually pretty good) is worded as carefully as it should be. By that, I mean that it is not true as written. Let's say we have s = new Semaphore(2); and one thread does s.acquire(1); foo(); s.release(1); and another thread does s.acquire(1); bar(); s.release(1). By a literal reading of the sentence above, foo() happens-before bar(); which is not true (in reality there is no specific ordering between them, as they could use "different" permits); also, accoding to the same sentence, bar() also happens-before foo(), which is a contradiction.

If we look at the case of monitors in JLS 17.4.5, this is what it says:

An unlock on a monitor happens-before every subsequent lock on that monitor.

The word "subsequent" seems important here (and I suspect that's what's missing from the semaphore apidoc). It is defined formally in 17.4.4. As far as I understand, a consequence of that definition is essentially the same as intuitively thinking in rel-acq pairs (or rel-acqs corresponding to each other). So there is happens-before, if the release and the acquire correspond to each other. (At least that's how I understand it.)

And I think this is where there is a problem with release(0) and acquire(0): they never correspond to each other (due to zero permits). ("Correspond" or "paired with" are obviously not precise enough to have in the spec, but I think it helps thinking about the issue.)

armanbilge commented 7 months ago

Another thing we should think about is whether to define these rules for the typeclasses in kernel, or only for IO. Defining them for the typeclasses is more useful (I think), but that would also mean that we create new rules that every implementer (ZIO, Monix(?), ...) must follow.

That's a really great point. Particularly Ref is the important building block: until we specify its memory effects, we are helpless to say anything about the semantics of a Concurrent-based Semaphore.

But that's good though, I think it gives a concrete place to start from.

The other key semantic that jumps to mind is about memory visibility when you have sequential delay(...) *> delay(...) but each of those delay blocks runs on a different thread.

I'd also like to figure out where and how we are currently relying on these implicit semantics ourselves.

durban commented 7 months ago

Particularly Ref is the important building block: until we specify its memory effects, we are helpless to say anything about the semantics of a Concurrent-based Semaphore.

Yeah. The rule could be the CE equivalent of this from the JLS: "A write to a volatile field happens-before every subsequent read of that field." (And of course this holds for the default Ref, due to using an AtomicReference.)

The other key semantic that jumps to mind is about memory visibility when you have sequential delay(...) *> delay(...) but each of those delay blocks runs on a different thread.

And this could be the CE equivalent of this. "If x and y are actions of the same thread and x comes before y in program order, then hb(x, y)." Obviously with fiber instead of thread, and "something" instead of program order. (As far as I can tell, this also holds currently due to Executor semantics. Of course we should double check that the WSTP follows correct Executor semantics, but it should.)

Similarly: we also have an equivalent of "A call to start() on a thread happens-before any actions in the started thread", when forking a fiber. And of "All actions in a thread happen-before any other thread successfully returns from a join() on that thread", when joining a fiber.

We should have something for async (or rather cont). Like: calling the callback happens-before the suspended fiber continuing. (I think this is also true currently.)

djspiewak commented 7 months ago

Relates to #2928

armanbilge commented 5 months ago

A little while back Daniel S and I chatted about this on Discord (sorry public internet) and had a follow-up conversation.

In those discussions, I propose that most Cats Effect applications and libraries actually do not rely on the JMM's final field semantics.[^1] In idiomatic Cats Effect code, data is explicitly shared via Ref and Deferred (or higher-level concurrent structures like Queue) and thus can be properly synchronized. This is important because publication of final fields is expensive on the JVM and on Scala Native, especially on ARM.

Continuing from there, I also want to hone down exactly what memory is published when synchronizing via Ref and Deferred. For example, they could create a full memory fence, but I argue that this is not necessary especially for idiomatic Cats Effect code which never uses vars etc.[^2] Instead, we only need to publish memory that is "structurally reachable" from the pointer we just published via the Ref/Deferred.

It turns out this is an existing concept in C++, known as memory_order_consume.

The consume memory order is a weaker form of acquire. Whereas acquire prevents all future memory accesses from being ordered ahead of the load, the consume order only prevents dependent future memory accesses from being reordered ahead of the load.

https://devblogs.microsoft.com/oldnewthing/20230427-00/?p=108107

The JMM also uses some notion of reachability to define its final field semantics.

... when the object is seen by another thread, that thread will always see the correctly constructed version of that object's final fields. It will also see versions of any object or array referenced by those final fields that are at least as up-to-date as the final fields are.

https://docs.oracle.com/javase/specs/jls/se21/html/jls-17.html#jls-17.5

Note, I'm not claiming that we can implement strictly these semantics. For example, final field semantics are not going away on the JVM and LLVM doesn't currently expose memory_order_consume. It's just about what we are explicitly promising in our API. Meanwhile, our implementation details will end up offering stronger guarantees in practice than what is actually specified by our memory model. For most developers this won't matter.

Anyway, the reason for bringing this up again is because the possibility of relaxing the memory model on Scala Native has finally cropped up :)

https://github.com/scala-native/scala-native/issues/3673

[^1]: The JMM's final field publication semantics are important when you are not using explicit synchronization, leaving memory visibility subject to data race. For example, we use this technique to support stealing of timers and packing a linked list. In these data race scenarios, the final field semantics are useful to guarantee that you see a consistent view of an object.

[^2]: Code that is working with vars (or other mutable memory) must already have a Sync constraint or greater and is arguably "low-level". In which case, that code is fully capable of using AtomicReference (instead of Ref) or manually fencing to achieve the desired memory effects.

durban commented 5 months ago

Just a note, memory_order_consume in C++ seems to be a mess (well... what isn't? ;-).

It is widely accepted that the current definition of memory_order_consume in the standard is not useful. All current compilers essentially map it to memory_order_acquire.

From P0371R1: Temporarily discourage memory_order_consume

That's from 2016, but newer notes are not much more encouraging:

The current definition of memory_order_consume is not sufficient to allow implementations to map a consume load to a "plain" load instruction (i.e. without using fences), despite this being the intention of the original proposal. Instead, memory_order_consume is typically treated as though the program had specified memory_order_acquire...

From P0735R1: Interaction of memory_order_consume with release sequences

Also interesting reading: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html

Why would final field semantics matter when we specify and/or implement Ref and Deferred? They seem to be orthogonal to me. (As you say, "final field publication semantics are important when you are not using explicit synchronization", which Ref/Deferred is.)

(On the other hand, final field semantics absolutely matter to scala-native.)

armanbilge commented 5 months ago

Why would final field semantics matter when we specify and/or implement Ref and Deferred? They seem to be orthogonal to me.

Another way we could define the CE memory model is that we could say that Ref and Deferred publish their own value (a pointer to an object) but have no other memory effect (I believe this is memory_order_relaxed). This model would still be useful so long as users could rely on final field semantics to safely publish pointers to immutable objects via Ref and Deferred.

durban commented 5 months ago

Huh, yeah, that makes sense. Basically, even if we'd make Ref useless, it wouldn't be completely useless :smile:

armanbilge commented 5 months ago

Cross-linking my comment on the Scala Native thread.

https://github.com/scala-native/scala-native/issues/3673#issuecomment-1896595985

durban commented 5 months ago

So I've skimmed the "All Fields are Final" article again, and the cost doesn't seem so dramatic to me. The conclusion seems to be that on x86 it is practically free, and on ARM it is also not too bad. It probably helps, that hotspot can somehow avoid the acq fence on reads. (Do we know why scala-native can't do that?)

(Btw, I think what hotspot does with eliding the acquire is essentially what memory_order_consume tries to be.)

But of course, if we can make it faster on ARM, that's always good. (I'm just afraid of immutable objects changing... that's not fun...)

armanbilge commented 5 months ago

It probably helps, that hotspot can somehow avoid the acq fence on reads. (Do we know why scala-native can't do that?)

Hmm, what are you referring to exactly? This bit?

C tc = c;
[LoadLoad]
r1 = tc.c;
... Turns out, most hardware also respects the order of so-called 'dependent' reads, and hence does not require emitting the barrier there.

If I understand correctly, I wouldn't describe this as avoiding the acquire fence. Rather, there is semantically an acquire there, that just happens to require emitting no instructions on most hardware.

But you did exactly nail the question I've been pondering today. Which is whether the implementation of final Field Semantics that I suggested in https://github.com/scala-native/scala-native/issues/3610#issuecomment-1843494686 is actually correct 😅

More broadly, my understanding was that for release to be useful it must always be paired with an acquire. This is straightforward enough when you are loading/storing at the same address. But in this case, we want to pair a release fence (at the end of the constructor) with a load acquire (of the final field). The C++ reference describes a fence-atomic synchronization mechanism corresponding to our situation.

A release fence F in thread A synchronizes-with atomic acquire operation Y in thread B, if

there exists an atomic store X (with any memory order),

Y reads the value written by X (or the value would be written by release sequence headed by X if X were a release operation),

F is sequenced-before X in thread A.

I found a nice illustration of this in these slides.

In the context of final Field Semantics, I suppose b would be the final field and a would be the reference to the constructed object. Notice how a is stored with relaxed ordering, which is stronger than from an ordinary, non-volatile write on the JVM (LLVM calls these monotonic and unordered, respectively). This StackOverflow answer also discusses some of the differences; IIUC relaxed is similar to Java volatile (i.e. cannot be cached) but with no other memory effect.

So this is where I'm getting a bit puzzled. It's quite possible that the JVM, C++, and LLVM are just not corresponding to each other in such a direct or obvious way. It's also possible that on most platforms there is no practical distinction between a relaxed store and an ordinary store.

But the question remains what is the correct way to implement final Field Semantics on Scala Native 🤔

Regarding performance on JVM vs Scala Native ... the "All Fields are Final" article discusses barrier coalescing on HotSpot. It's possible that Native is unable to do this kind of optimization. If it did, it would have to be something in LLVM's optimizer (didn't see any hits in a quick search). So this could explain why the performance impact is much worse on Native.

durban commented 5 months ago

If I understand correctly, I wouldn't describe this as avoiding the acquire fence. Rather, there is semantically an acquire there, that just happens to require emitting no instructions on most hardware.

You're right. Or rather, semantically there is a consume there; because I think they can avoid a CPU-fence instruction exactly because it's a consume-like situation (i.e., they don't need a general-purpose acquire).

Regarding how finals are implemented in scala-native: is your concern that (1) the release is not paired correctly with the acquire, because they're for different memory addresses; or that (2) the object reference may be written with a plain store instead of a relaxed one? (Or both?)

(Now we're talking about something which I only have a vague understanding of, so don't take me too seriously.)

About (1): I think this is a valid concern. It seems that the acquire fence will be in the wrong place. If I understand it correctly, this is what happens now:

// constructor:
this.finalField = something
releaseFence()

// using the object:
val obj = ...
val x = obj.finalField
acquireFence()

While the constructor seems correct, at the use site what should happen is more like this:

// using the object:
val obj = ...
acquireFence()
val x = obj.finalField

About (2): I think C++ relaxed approximately corresponds to opaque, which is stronger than plain. But, there is this (here, emphasis mine):

Instead of X.setRelease(this, v), you can use Opaque mode (or Plain mode if x is primitively bitwise atomic), preceded with a releaseFence: VarHandle.releaseFence(); x = v;. And similarly, instead of int v = X.getAcquire(this), you can follow a Plain or Opaque mode access with an acquireFence: int v = x; VarHandle.acquireFence().

Are object references primitively bitwise atomic in scala-native? They are on the JVM. (But when Valhalla value types arrive, those might not be!) If they are also in scala-native, (2) seems fine to me. (But honestly, I don't really understand the difference between LLVM monotonic and unordered...)

Regarding what's implementable and what is not (when we're implementing, e.g., Ref): I assume on scala-native we could do whatever LLVM allows(?). On the JVM: if we let go of JVM 8, we have a lot of toys to play with. In fact, we can have most of the things C++ has :smile: (with the notable exception of consume).

armanbilge commented 5 months ago

Or rather, semantically there is a consume there; because I think they can avoid a CPU-fence instruction exactly because it's a consume-like situation (i.e., they don't need a general-purpose acquire).

Actually it is even simpler than a consume-like situation because of the finality of the field. The sample_consume_disallowed() example on this blog post illustrates that consume disallows speculation of finalField prior to loading obj, because of the dependency. In our case, speculation is ok: because finalField is final, any values obtained by speculation prior to the load must be consistent with the value after the load.

Attempting to illustrate more concretely, consider:

val obj1 = getObjUnordered()
val obj2 = getObjUnordered()
val x = obj2.finalField

It would be okay for that code to be re-written to:

val obj1 = getObjUnordered()
val speculation = obj1.finalField
val obj2 = getObjUnordered()
val x = if (obj2 eq obj1) speculation else obj2.finalField

Contrast with:

val obj1 = getObjUnordered()
val obj2 = getObjConsume()
val x = obj2.finalField

In this case, that rewrite is not allowed.

It seems like final Field Semantics are working without barriers because "most" hardware works that way. But I'm not sure which hardware 😅

Breaking this order is not directly possible on runtime side, because it will violate the data dependencies. ... Turns out, most hardware also respects the order of so-called 'dependent' reads, and hence does not require emitting the barrier there.

About (1): I think this is a valid concern. It seems that the acquire fence will be in the wrong place.

Ugh, you are right, it is in the wrong place. And actually I had completely missed this.

Are object references primitively bitwise atomic in scala-native?

Not sure. I hope so 😅 object references are modeled as LLVM pointers and I can't find any LLVM-related hits for those keywords. They probably call it something else ...

But honestly, I don't really understand the difference between LLVM monotonic and unordered...

I agree, it seems to be (approximately) equivalent to the JVM's opaque and plain.

In any case, thanks, that section you quoted is helpful. Still don't feel like I fully grok anything 🤔 It definitely does seem like the acquire for loading final fields is in the wrong place and that we probably don't need it in practice. The combination of the data dependency and finality seems sufficient to achieve the semantics we want ... on "most hardware" 😅 I'm going to stare harder at this for a while.

I assume on scala-native we could do whatever LLVM allows(?).

On Scala Native we can do whatever NIR (Scala Native bytecode) allows. In practice, this corresponds closely to LLVM, including for atomics.

durban commented 5 months ago

In our case, speculation is ok: because finalField is final, any values obtained by speculation prior to the load must be consistent with the value after the load.

I don't think so. This whole game with fences is exactly because it is possible to read 2 different values from a final field. Namely, the null before the initialization, and its initialized value.

It seems like final Field Semantics are working without barriers because "most" hardware works that way. But I'm not sure which hardware 😅

Apparently Alpha is not like "most" hardware:

Without rcu_dereference(), DEC Alpha can load a pointer, dereference that pointer, and return data preceding initialization that preceded the store of the pointer.

From https://www.kernel.org/doc/Documentation/RCU/rcu_dereference.txt (which is about Linux RCU, which was apparently the motivation for consume).

But also, the CPU is not the only one who can reorder things: the compiler(s) can do that too.

Without memory order consume, both the compiler and (again, in the case of DEC Alpha) the CPU would be within their rights to carry out aggressive data-speculation optimizations that would permit readers to see pre-initialization values in the newly added data elements.

From https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0098r0.pdf.

So I think final fields work without (read) barriers in Hotspot, because (1) there is no Hotspot for Alpha (probably?), and (2) Hotspot knows, that neither Hotspot, nor the Java compiler makes any optimization which would break them. (I think (2) is equivalent to saying that there is a compiler read barrier for final fields.)

Btw., interesting (although long, and at points very confusing) discussion about consume: https://news.ycombinator.com/item?id=36040457

armanbilge commented 5 months ago

Thanks, I found jcranmer's comments in this thread of the HN post very helpful.

So I think final fields work without (read) barriers in Hotspot, because (1) there is no Hotspot for Alpha (probably?), and (2) Hotspot knows, that neither Hotspot, nor the Java compiler makes any optimization which would break them. (I think (2) is equivalent to saying that there is a compiler read barrier for final fields.)

Okay, now I see what you are saying. I agree.

For the record, I have a hard time seeing what kind of optimizations a compiler could make at the use-site of a final field that allow it to see uninitialized values, without violating the data dependency. Of course the limits of my imagination don't count for anything in practice 😅

Unfortunately this is bad news for Scala Native where we don't have fine control over the optimizations that LLVM makes. Or rather, emitting these barriers is how we control the optimizations.

I'm worried that to correctly implement final Field Semantics it must:

emit a release fence at the end of the constructor of a class with final fields
store instances with relaxed ordering
load instances with acquire ordering

Since so much code is generic, eliding the barriers for stores/loads is difficult unless you know at linking time that no subclasses require final Field Semantics.

durban commented 5 months ago

store instances with relaxed ordering

Doesn't scala-native need this anyway? For every access to shared memory, since data races are undefined behavior (on LLVM).

armanbilge commented 5 months ago

For every access to shared memory, since data races are undefined behavior (on LLVM).

Scala Native uses unordered for this.

It essentially guarantees that races produce somewhat sane results instead of having undefined behavior. ... This is intended to match the Java memory model for shared variables.

durban commented 5 months ago

Ah, okay. And that's not enough for the situation with finals? Does it really need monotonic (which is apparently the same as relaxed)?

armanbilge commented 5 months ago

Does it really need monotonic (which is apparently the same as relaxed)?

Well, maybe not, based on what you shared in https://github.com/typelevel/cats-effect/issues/3899#issuecomment-1898911499.

Instead of X.setRelease(this, v), you can use Opaque mode (or Plain mode if x is primitively bitwise atomic), preceded with a releaseFence: VarHandle.releaseFence(); x = v;

If we can confirm primitive bitwise atomicity of pointers, then I suppose an unordered write would be sufficient.

armanbilge commented 5 months ago

I suppose an unordered write would be sufficient.

What bothers me is that even though that JDK9 Memory Order Modes guide suggests this would work, nothing in C/C++ world says that. If anything, they seem to say the opposite i.e. that it won't work 😕

It is possible to issue memory barrier without an associated atomic operation ... [but] still requires atomic operations to work as defined

durban commented 5 months ago

We're getting very much offtopic here, but...

If anything, they seem to say the opposite i.e. that it won't work

Yeah. Except that's C++; what "we" (i.e., scala-native, really :-) care about is LLVM. I think unordered is fine for LLVM:

A fence A which has (at least) release ordering semantics synchronizes with a fence B with (at least) acquire ordering semantics if and only if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M (either directly or through some side effect of a sequence headed by X), Y is sequenced before B, and Y observes M.

From https://llvm.org/docs/LangRef.html#id219 (emphasis mine).

So, like C++, LLVM requires atomic operations (i.e., an atomic operation with any ordering constraint). Reading https://llvm.org/docs/LangRef.html#atomic-memory-ordering-constraints, it seems that the weakest one is unordered. So that should be fine.

Why the difference? I think it's simply because the weakest possible atomic operation in C++ is a relaxed one (which corresponds to LLVM monotonic). But LLVM introduces a weaker (but still atomic!) operation: unordered.

armanbilge commented 5 months ago

I think unordered is fine for LLVM ... LLVM introduces a weaker (but still atomic!) operation: unordered.

Thanks. Yes, I've been pondering the same. If true (and seems like it should be) that's very good ... however I no longer have a good intuition for what atomic actually means 😅 LLVM explains but I haven't really grokked it.

These operations are required to be atomic in the sense that if you use unordered loads and unordered stores, a load cannot see a value which was never stored. ... Some examples of unsafe optimizations are narrowing an assignment into a bitfield, rematerializing a load, and turning loads and stores into a memcpy call.

I just don't have a good enough understanding of how and why compilers apply those sorts of optimizations to unordered loads and stores and furthermore, why that would cause fences to have undefined behavior 🤔

We're getting very much offtopic here, but...

Ha, maybe 😅 I'll invoke All Fields Are Final in an attempt to defend this tangent:

Since memory models are great compromise about what users can expect, and what we are actually able to pull off, any semantics needs a plausible implementation.

i.e. understanding what we can implement and how is a prerequisite for defining the Cats Effect memory model. In particular, if it is impossible to efficiently implement final Field Semantics on Scala Native (because we don't control the entire compilation pipeline the way that the JVM/Hotspot does) then we should be more aggressive about designing and implementing CE in a way that does not rely on those semantics.

In any case, thanks for your patient and thoughtful engagement on this subject, much appreciated :)

durban commented 5 months ago

At this point, I think I'll just say that atomic is whatever the spec says is atomic; and if another part of the spec (e.g., the fence instruction) requires atomic, then anything that's defined to be atomic must be fine. :man_shrugging:

But, speculating a little bit, a compiler might want to sometimes generate a 64-bit write as two 32-bit writes (e.g., on a 32-bit platform, or something). This would not be atomic (the "transforms a store into multiple stores" part); and obviously could cause problems with reading values which never were written. And if I'm dereferencing a pointer which never existed... that's obviously very bad. (And I think it's undefined behavior exactly because they want compilers to be able to freely do these kinds of transformations.)

durban commented 5 months ago

Returning to CE a little bit...

if it is impossible to efficiently implement final Field Semantics on Scala Native (...) then we should be more aggressive about designing and implementing CE in a way that does not rely on those semantics

About designing (i.e., what we spec): it seems to me, that the only way we'd really rely on final semantics is the one you mentioned earlier (when I asked), when Ref/Deferred has opaque (i.e., relaxed) semantics. To me that seems way too weak for a general purpose thing like Ref. (I guess it could be useful for some kind of statistical counter, but beyond that...)

The next stronger possible way to spec them seems to be Ref/Deferred having release/consume semantics. If I understand correctly, you say that "idiomatic CE code" should be fine with that. I have some opinions on that, but if we accept that, then said code doesn't rely on final semantics any more. (Since it publishes through Ref/Deferred, it could even have var fields, as long as it only writes them in the constructor. It will be fine due to the release/consume of the Ref/Deferred.)

About implementing: CE is full of intentionally racy internal code where it relies on finals. If it can't do that... it would need a lot of volatiles (or AtomicReferences, etc.). I don't think that would make anything faster... (But maybe you meant implementing Ref/Deferred specifically?)

armanbilge commented 5 months ago

About implementing: CE is full of intentionally racy internal code where it relies on finals. If it can't do that... it would need a lot of volatiles (or AtomicReferences, etc.). I don't think that would make anything faster

It might be helpful to talk about some specific examples. Here's one that I know of:

https://github.com/typelevel/cats-effect/blob/97b0174090d9c3837055695f504cd3d62cba69ff/core/shared/src/main/scala/cats/effect/IOFiber.scala#L164-L165

Currently this works via data race because the IO.Pure(outcome) publishes its final field. If it didn't, you could join a completed fiber and read an uninitialized null instead of the outcome.

If IO.Pure(...) no longer publishes its fields, then we will have to mark @volatile var _join. I don't think this should make anything slower:

previously, there was a release in the Pure ctor and an acquire prior to reading its value field
now, there is a release when writing _join and an acquire when reading join

Meanwhile, we removed the release/acquire barriers from all other uses of IO.Pure. That should make things faster.

The next stronger possible way to spec them seems to be Ref/Deferred having release/consume semantics. If I understand correctly, you say that "idiomatic CE code" should be fine with that. I have some opinions on that

Care to share? :)

durban commented 5 months ago

It might be helpful to talk about some specific examples

Yeah, you're right:

The one you linked (_join): this is a good example; however, volatile is stronger than rel/acq. Also, as we previously discussed, on the JVM the acquire for finals is optimized to a nop :-)
IOFiber#_cancel is basically the same thing.
IOFiber#captureTrace reads the final field tracingEvents. If it reads null, we get no trace in our liveFiberSnapshot; not the end of the world, but also not nice.

Yeah, ok, I admit this is not that much... I remembered (imagined?) more :man_shrugging:

If we count open PRs, there is also #3781. If I remember correctly, it counts on Node#triggerTime being a val and always being correctly visible.

Not intentional races in CE, but possible races in user code (I guess we could say that these are forbidden...):

The WSTP is accessible (seen as an ExecutionContext) by user code which doesn't run on the WSTP. Its execute calls scheduleExternal, which reads val externalQueue, which then reads val queues in ScalQueue. Maybe it's not typical to use an executor obtained through a race, but I believe it is expected to work. (I mean... I'd expect executors to be thread-safe...) And the problem could be recursive: I think it is typical, to store an executor in a final field...
A more general version of the previous point: are the thread-safe objects in CE expected to be really thread-safe? E.g., if user code obtains an UnboundedAsyncQueue through a race, and finals "don't work", then its operations could fail with NPEs.

Not final fields, but data race (probably unintentional?):

LocalQueue#getTotalFiberCount and others below it read plain vars which are longs; this is not atomic, and they can read values which were never written. (This is only used by the LocalQueueSamplerMBean, so again, not critical, but still...)

Care to share? :)

I will, just need time to think it through (and write it down) properly...

durban commented 5 months ago

So, let's imagine, that we say that idiomatic CE code should be fine with using Ref/Deferred and use only "structurally reachable" things. If I understand it correctly, this is approximately saying, that Ref#set and Deferred#complete have release semantics; and Ref#get and Deferred#get have consume semantics. (And Ref#modify and friends have consume/release.)

I believe this would make totally reasonable code like this incorrect. In fact, it would make incorrect the whole idiom of using a Deferred[F, Unit] to "wake up" someone, who should (re)try doing something with a Ref. The one which "wakes up" the other, essentially does this:

state.set(newState); // release
deferred.complete(()); // release

While the other fiber:

deferred.get(); // consume a ()
state.get(); // consume a state; but this line could be reordered before the previous one!
...

So after waking up, it could read an "old" state (i.e., it doesn't necessarily sees newState which was written by the first fiber).

armanbilge commented 5 months ago

however, volatile is stronger than rel/acq. Also, as we previously discussed, on the JVM the acquire for finals is optimized to a nop :-)

Yes. Well, let's be more precise about what exactly we are discussing ...

On the JVM, final Field Semantics aren't going away. We could define classes with vars instead of vals, but this might have other implications, and as you just described @volatile (which uses the seq_cst ordering) or even VarHandle would probably have worse performance.

On Scala Native, final Field Semantics will be configurable. Also, the current implementation of those semantics does rely on an acquire for reading final fields. And lastly, instead of clumsily applying @volatile we can actually load/store with the precise orderings that we want.

Suffice to say, it seems less likely that changing our implementation on the JVM will bring any optimizations. But we have a lot more potential to do so on Scala Native.

IOFiber#captureTrace reads the final field tracingEvents. If it reads null, we get no trace in our liveFiberSnapshot; not the end of the world, but also not nice.

I think this one is different from _join and _cancel. tracingEvents is initialized in the IOFiber constructor and its value does not change. To read its value you need a reference to the IOFiber itself, and in practice there are two ways you will obtain that reference:

(indirectly) via a load with acquire ordering, which should also publish tracingEvents
via data race, in which case 🤷 there were no visibility guarantees anyway

If we count open PRs, there is also https://github.com/typelevel/cats-effect/pull/3781

Yes, definitely count this one :) I admit this one definitely needs some thinking about.

The Node#triggerTime thing is annoying. If we assume that loading the uninitialized value always returns 0 maybe we could treat that as a special sentinel that's never a valid trigger time. But I'm not sure if that's safe to assume ...

And in fact there is another memory publication issue there that I've been worried about. The Node#callback used to resume the IOFiber runloop closes over state. It is obscured by the closure but we actually rely on final Field Semantics to get that state published so that the callback works when invoked by a stealing thread. https://github.com/typelevel/cats-effect/blob/85d6f3c86a345d179589414dfbb5bd33e0432702/core/shared/src/main/scala/cats/effect/IOFiber.scala#L628-L630

Instead of encoding that callback as an anonymous closure, we could implement it as a top-level (static) class that explicitly takes all its state in a constructor. Then in the implementation of apply we would code defensively against uninitialized state.

Putting aside those hacks, I think we could definitely make it work like this:

emit a release fence prior to storing the Node in the heap
if reading fields on the owner thread, use an ordinary load
if reading fields on a stealing thread, emit an acquire fence before loading

Yes, I realize this is just replicating final Field Semantics 😅 and maybe that's the moral of the story here. It's very difficult to escape synchronization.

Not intentional races in CE, but possible races in user code (I guess we could say that these are forbidden...) ... are the thread-safe objects in CE expected to be really thread-safe?

Tough questions. I don't know. I see what you are saying, esp. about the definition of thread-safety. "Thread-safe but only if published with appropriate fencing" doesn't give the usual cozy thread-safe feeling.

In Scala Native, my argument has been that the number of developers designing algorithms based on data races is small and that the onus should fall to them to know what they are doing.

For Cats Effect users to encounter these sorts of things requires creeping out of the "idiomatic" patterns into a wildly unsafe world. May I cop-out and declare these things as implementation-specific? 😅

LocalQueue#getTotalFiberCount and others below it read plain vars which are longs; this is not atomic, and they can read values which were never written. (This is only used by the LocalQueueSamplerMBean, so again, not critical, but still...)

Mmm. Seems like a legit bug, independent of this discussion.

I believe this would make totally reasonable code like this incorrect. In fact, it would make incorrect the whole idiom of using a Deferred[F, Unit] to "wake up" someone, who should (re)try doing something with a Ref.

This is a fantastic observation. Thank you so much for thinking about this :)

I am going to think more about it, but even if hypothetically there is a "better" or "more idiomatic" pattern that could be used instead, I absolutely agree with you that it is very very reasonable code. Paired by the fact that we cannot implement consume-like ordering anyway, we should just spec acquire for Ref.

durban commented 5 months ago

instead of clumsily applying @volatile we can actually load/store with the precise orderings that we want.

We can absolutely do this on the JVM too (versions > 8).

Suffice to say, it seems less likely that changing our implementation on the JVM will bring any optimizations. But we have a lot more potential to do so on Scala Native.

:+1:

IOFiber#captureTrace reading tracingEvents: I think we can read an IOFiber through a race. Here we read fibers from WeakBags, and we can see a fiber another thread is writing. Entry in WeakBag stores the fiber in a final field, but if those are not working in the way they're on the JVM, we can see a "half-initialized" fiber.

(Besides, saying that there are "no visibility guarantees anyway" when reading through a race, is a little strange to me. There are guarantees for final fields... we're just discussing relaxing them :-)

3781: seeing "at least" the initial `0L` is guaranteed (on the JVM definitely). But the problem is, that if it's not `final`, (this being a `long`), we can read "half" of it (I think it's called "tearing"?). (Edit: tearing is something else, this is just the lack of access atomicity.)

The closure is a very good point. (I was thinking about closures, but didn't put together the two things.) I expect this is a thing that will be easy to forget. (Btw., how do you use the new scala-native annotation on the fields of a closure?)

It's very difficult to escape synchronization.

:100:

If we try to avoid synchronization for performance in specific cases, like in #3781; and also avoid synchronization in general for final fields (like in scala-native 0.5)... that has consequences...

About thread-safety of WSTP and others:

In Scala Native, my argument has been that the number of developers designing algorithms based on data races is small and that the onus should fall to them to know what they are doing.

Sure. But maybe we're underestimating the number of unintentional data races? I suspect the reason the JVM has more or less "reasonable" semantics even for data races is exactly because they expected them to be common, and wanted to avoid the situation in C/C++, where it's undefined behavior. Whether they're really that common (or if they're common specifically in Scala) is a very good question...

May I cop-out and declare these things as implementation-specific?

That's basically saying "no" :slightly_smiling_face: (i.e., generic code will need to publish them with synchronization.)

This is a fantastic observation. Thank you so much for thinking about this :)

Sure thing, it's interesting stuff. (Although now I know more about the C++ memory model than I've ever wanted to...)

durban commented 4 months ago

Related issue: #2484.

armanbilge commented 4 months ago

(Sorry for slow cadence, got a bit behind on stuff :)

Besides, saying that there are "no visibility guarantees anyway" when reading through a race

Sorry, let me try to explain again. Your original comment was:

IOFiber#captureTrace reads the final field tracingEvents. If it reads null, we get no trace in our liveFiberSnapshot; not the end of the world, but also not nice.

If we are reading a reference to a fiber via a race, then we might not even see that fiber. As you say, not the end of the world, but also not nice. Suppose we did read a non-null reference to the fiber. Now we would attempt to read its tracingEvents field, and as you point out this could also be null. Still not the the end of the world, and also not nice, but is it any more not nice? 🤷 We might not have seen this fiber at all anyway.

I think it would be especially not nice if we were guaranteed to see the reference to the fiber, but not guaranteed to see its contents. But when we are reading references to fibers themselves via a race, this is not the case: we are neither guaranteed to see the reference nor its contents. This is what I meant by "no visibility guarantees anyway".

seeing "at least" the initial 0L is guaranteed (on the JVM definitely)

I suppose so, but this is less clear to me on Scala Native. What confuses me is that the same address spaces may be used over and over again for different objects as stuff gets GCed and allocated. So in the case of data race I don't understand what prevents you from reading (stale) values of an old, GCed object and how you can be confident you are seeing at least the "cleared" values i.e. 0s (which is still stale).

But the problem is, that if it's not final, (this being a long), we can read "half" of it

Mmm, right, you are referring to 17.7. Non-Atomic Treatment of double and long?

The closure is a very good point. (I was thinking about closures, but didn't put together the two things.) I expect this is a thing that will be easy to forget.

Indeed.

(Btw., how do you use the new scala-native annotation on the fields of a closure?)

By writing it out as a top-level class with the appropriate fields that extends FunctionN. Then those fields can be annotated as required.

On Scala JVM and JS using a closure vs a class generates meaningfully different code (e.g. invokedynamic), but in Scala Native it ends up encoding to the same thing i.e. the closure is encoded as a class. So explicitly writing it out as an explicit class shouldn't add any performance penalty.

But maybe we're underestimating the number of unintentional data races? I suspect the reason the JVM has more or less "reasonable" semantics even for data races is exactly because they expected them to be common

Tough to say. final fields are everywhere in functional Scala which loves immutability, but that's less the case for Java particularly at the time when the JMM was introduced in Java 1.5 IIRC. So my gut feeling is that final field semantics had much less impact on Java code (and any data races that were common/uncommon) than it did for Scala code.

May I cop-out and declare these things as implementation-specific?

That's basically saying "no" 🙂 (i.e., generic code will need to publish them with synchronization.)

Haha, fair enough. I guess I was casting doubt on whether such code could be generic anyway i.e. if someone has dropped low-level enough that they have to worry about safety of WSTP when published through data race, chances are they are writing code that already cannot be generic across Scala Native and Scala JVM and needs to be split into platform-specific implementations.

But as you say, maybe this "low level" is not quite as low and uncommon as I think it is.

durban commented 4 months ago

About "no visibility guarantees anyway": thanks, I understand now. I agree.

So in the case of data race I don't understand what prevents you from reading (stale) values of an old, GCed object

I know (almost) nothing about scala-native, but this indeed seems an important thing to think about; fwiw, I'd consider it a (serious) bug in scala-native, if it's possible to see values from a GC'd object.

Mmm, right, you are referring to 17.7. Non-Atomic Treatment of double and long?

:+1:

By writing it out as a top-level class with the appropriate fields ...

I'd say the fact that we'd have to do this shows that this scala-native annotation is not that well thought through. The annotation itself could be made cross-compilation friendly (by making a stub for JVM/JS), but doing this requires split sources. Which are a pain...

I guess I was casting doubt on whether such code could be generic anyway

Ehh... I see your point. But I still see value in having certain "inherently" multi-threading-adjacent objects being "unconditionally" thread-safe. E.g., Thread objects themselves, ExecutionContexts, and so on.

durban commented 4 months ago

Further reading about final fields (and other related things, like NonAtomic vs. Unordered) in kotlin native: https://github.com/JetBrains-Research/KotlinMemoryModelResearch/blob/master/research/final-fields.md (and also other documents in that repo).

Even more reading: https://weakmemory.github.io/project-theses/studying-compilation-schemes-KN.pdf.

typelevel / cats-effect

We need to be more explicit about the Cats Effect memory model #3899

If we count open PRs, there is also #3781. If I remember correctly, it counts on Node#triggerTime being a val and always being correctly visible.

3781: seeing "at least" the initial 0L is guaranteed (on the JVM definitely). But the problem is, that if it's not final, (this being a long), we can read "half" of it (I think it's called "tearing"?). (Edit: tearing is something else, this is just the lack of access atomicity.)

If we count open PRs, there is also #3781. If I remember correctly, it counts on `Node#triggerTime` being a `val` and always being correctly visible.

3781: seeing "at least" the initial `0L` is guaranteed (on the JVM definitely). But the problem is, that if it's not `final`, (this being a `long`), we can read "half" of it (I think it's called "tearing"?). (Edit: tearing is something else, this is just the lack of access atomicity.)