Closed jonatanklosko closed 1 year ago
❤️❤️❤️🐥🔥
Released in v2.1.2. Thank you so much!
Beautiful, thanks! <3
With this change I am seeing issues where the presence ETS values tables on different nodes are drifting out of sync. It seems that presence entries from processeses on other nodes are not always cleaned up. This forced me to roll back to 2.1.1 for the time being. I will see if I can come up with a reproducible case.
@arjan are you comparing the ETS rows, or the result of Presence.list
? Also, do they differ just in excess processes or the metadata itself?
I went through the changes a couple times and so far haven't pinpointed anything that would introduce any meaningful difference.
I noticed it because our stats use the output of Presence.list
and the counts were way off yesterday morning. Hopefully I can reproduce it today.
Tracker update is handled as leave + join and ETS updates are applied in this notion, that is, a delete followed by an insert. Since tracker lookup reads ETS on the caller side, it can happen concurrently to tracker update, so there is a race condition when the ETS read happens between said delete and insert.
The "values" ETS table (which keeps the metadata) is an :ordered_set, which means that we can do an atomic update just by inserting the new value, without prior delete. So I made two changes:
State.leave_join
, which works just likeState.leave
+State.join
, but without the unnecessary deleteState.merge
to gather desired inserts/deletes to ETS, do a simple diff and only then apply the changes (which makes an entry update atomic)The changes only affect applying changes to the internal ETS storage, the update is reported as leave + join in exactly the same way.
Context: in Livebook we use tracker for registering notebook session processes and we also store basic session information necessary for listing, which gets updated. We have a lot of tests that create and lookup sessions and we started to observe more and more intermittent test failures that basically go: create session -> lookup session -> session not found. Running tests in a loop I could reproduce this behaviour and tracked it down to this race condition, with this PR it no longer happens.