As-is, our ArcStr::clone and ArcStr::drop code is more or less taken verbatim from the stdlib's std::sync::Arc<T>'s implementation*. There are a lot of reasons for this, mainly the fact that it was, at one point, proven correct. However, ArcStr offers some meaningfully stronger guarantees which could perhaps allow us improve our implementation of Drop.
Right now Drop (for non-literal ArcStrs) is conceptually:
if self.count.fetch_sub(1, Ordering::Release) == 1 {
// See footnote for a comment about this.
let _ = self.count.load(Ordering::Acquire);
// call non-inline function to compute layout and deallocate memory and such.
unsafe { self.drop_slow() };
}
However, given that we're immutable, it might give you pause that we need that Ordering::Acquire, which only ever executes on this thread (so it's purpose can't be to ensure other threads see our changes that we just made in fetch_sub). This tracks with discussion in the bug report linked in the footnote, which describe it as essentially ensuring that the actual destructor sees the final view of the type, e.g. ensuring that our thread gets changes written by other threads. Well, we don't actually need those, since ArcStr is fully immutable.
In practice, it may also serve as an optimization barrier to ensure writes from drop_slow (e.g writes from inside the memory allocator's dealloc) aren't moved before self.count.fetch_sub. This would be pretty wild, given that there's a conditional, but I guess in theory nothing strictly prevents it...
A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store.
It's not 100% clear to me which part of the C or C++ standard here that this is using to back this up. In practice, it does seem like a given, but most of Releases semantics seem to be defined around the interaction with Acquire. I can almost see it given the definition of release sequences (in conjunction with happens before / dependency on / etc), but it's hard for me to say anything that concretely. Anyway, this matters because if we can't guarantee it and don't have an Acquire, there's nothing preventing other threads from doing their decref, and then continuing to read from the ArcStr despite a remote thread being allowed to deallocate the str's memory.
Do I think we might be able to remove the Acquire fence? Well, kinda:
It feels pretty likely that cppreference's interpretation that reads on a thread cannot be moved after a fetch_sub(1, Release) on the same thread is correct. In practice it's impossible to implement Mutex<T> without a guarantee of this. (In order to change the code I would need more concrete evidence, and perhaps this ultimately is going to be the downfall, but... yeah).
Additionally, while it's true that without the Acquire fence there's no ordering individual threads have with each other, it is guaranteed that the fetch_subs are still atomic, and that at most one thread will ever reach 0, and be allowed entry into the conditional block, and be allowed to free the memory.
That said, neither of these are really strong guarantees under the memory model, and so I'd need something stronger to convince me. Hell, even with obvious mistakes in the ordering AFAICT loom happily accepts my code... So maybe I need to install coq or whatever theorem prover has a specification of the C memory model's atomics that I can easily use... That said, it seems likely to be a while before I can do something like this.
Anyway, this was a bit rambling, and I probably repeated myself, but I wanted to get it down somewhere so that I can later take a look and say "ah no, I already thought about that" and such.
* Okay, not verbatim for ArcStr::drop. We take a small modification to replace an explicit atomic::fence(Acquire) with by a .load(Acquire) of the refcount. The stdlib actually does this too, but only when running under thread-sanitizer. At one point there was a PR for this change (https://github.com/rust-lang/rust/pull/41714), which AFAICT mainly didn't merge because the policy at the time was to only accept PRs like this with benchmarks showing improvements, which will be hard given that the distinction here matters mainly in cases of high contention, and is specific to how the cache is being used.
However, it's possibly worth noting that this means the variant we use, while (IMO) "obviously" equivalent to using a fence, is not actually identical to what was proven correct in RustBelt.
As-is, our ArcStr::clone and ArcStr::drop code is more or less taken verbatim from the stdlib's
std::sync::Arc<T>
's implementation*. There are a lot of reasons for this, mainly the fact that it was, at one point, proven correct. However,ArcStr
offers some meaningfully stronger guarantees which could perhaps allow us improve our implementation ofDrop
.Right now Drop (for non-literal ArcStrs) is conceptually:
However, given that we're immutable, it might give you pause that we need that
Ordering::Acquire
, which only ever executes on this thread (so it's purpose can't be to ensure other threads see our changes that we just made infetch_sub
). This tracks with discussion in the bug report linked in the footnote, which describe it as essentially ensuring that the actual destructor sees the final view of the type, e.g. ensuring that our thread gets changes written by other threads. Well, we don't actually need those, sinceArcStr
is fully immutable.In practice, it may also serve as an optimization barrier to ensure writes from
drop_slow
(e.g writes from inside the memory allocator's dealloc) aren't moved beforeself.count.fetch_sub
. This would be pretty wild, given that there's a conditional, but I guess in theory nothing strictly prevents it...Further, while https://en.cppreference.com/w/cpp/atomic/memory_order says that
memory_order_release
provides the following guarantee:It's not 100% clear to me which part of the C or C++ standard here that this is using to back this up. In practice, it does seem like a given, but most of
Release
s semantics seem to be defined around the interaction withAcquire
. I can almost see it given the definition of release sequences (in conjunction with happens before / dependency on / etc), but it's hard for me to say anything that concretely. Anyway, this matters because if we can't guarantee it and don't have an Acquire, there's nothing preventing other threads from doing their decref, and then continuing to read from theArcStr
despite a remote thread being allowed to deallocate the str's memory.Do I think we might be able to remove the Acquire fence? Well, kinda:
It feels pretty likely that cppreference's interpretation that reads on a thread cannot be moved after a
fetch_sub(1, Release)
on the same thread is correct. In practice it's impossible to implementMutex<T>
without a guarantee of this. (In order to change the code I would need more concrete evidence, and perhaps this ultimately is going to be the downfall, but... yeah).Additionally, while it's true that without the
Acquire
fence there's no ordering individual threads have with each other, it is guaranteed that thefetch_sub
s are still atomic, and that at most one thread will ever reach 0, and be allowed entry into the conditional block, and be allowed to free the memory.That said, neither of these are really strong guarantees under the memory model, and so I'd need something stronger to convince me. Hell, even with obvious mistakes in the ordering AFAICT
loom
happily accepts my code... So maybe I need to install coq or whatever theorem prover has a specification of the C memory model's atomics that I can easily use... That said, it seems likely to be a while before I can do something like this.Anyway, this was a bit rambling, and I probably repeated myself, but I wanted to get it down somewhere so that I can later take a look and say "ah no, I already thought about that" and such.
* Okay, not verbatim for
ArcStr::drop
. We take a small modification to replace an explicitatomic::fence(Acquire)
with by a.load(Acquire)
of the refcount. The stdlib actually does this too, but only when running under thread-sanitizer. At one point there was a PR for this change (https://github.com/rust-lang/rust/pull/41714), which AFAICT mainly didn't merge because the policy at the time was to only accept PRs like this with benchmarks showing improvements, which will be hard given that the distinction here matters mainly in cases of high contention, and is specific to how the cache is being used.However, it's possibly worth noting that this means the variant we use, while (IMO) "obviously" equivalent to using a fence, is not actually identical to what was proven correct in RustBelt.