phoenixframework / phoenix_pubsub

Distributed PubSub and Presence platform for the Phoenix Framework
http://www.phoenixframework.org/
MIT License
653 stars 124 forks source link

Update ETS atomically on tracker update #172

Closed jonatanklosko closed 1 year ago

jonatanklosko commented 1 year ago

Tracker update is handled as leave + join and ETS updates are applied in this notion, that is, a delete followed by an insert. Since tracker lookup reads ETS on the caller side, it can happen concurrently to tracker update, so there is a race condition when the ETS read happens between said delete and insert.

The "values" ETS table (which keeps the metadata) is an :ordered_set, which means that we can do an atomic update just by inserting the new value, without prior delete. So I made two changes:

The changes only affect applying changes to the internal ETS storage, the update is reported as leave + join in exactly the same way.

Context: in Livebook we use tracker for registering notebook session processes and we also store basic session information necessary for listing, which gets updated. We have a lot of tests that create and lookup sessions and we started to observe more and more intermittent test failures that basically go: create session -> lookup session -> session not found. Running tests in a loop I could reproduce this behaviour and tracked it down to this race condition, with this PR it no longer happens.

chrismccord commented 1 year ago

❤️❤️❤️🐥🔥

chrismccord commented 1 year ago

Released in v2.1.2. Thank you so much!

jonatanklosko commented 1 year ago

Beautiful, thanks! <3

arjan commented 1 year ago

With this change I am seeing issues where the presence ETS values tables on different nodes are drifting out of sync. It seems that presence entries from processeses on other nodes are not always cleaned up. This forced me to roll back to 2.1.1 for the time being. I will see if I can come up with a reproducible case.

jonatanklosko commented 1 year ago

@arjan are you comparing the ETS rows, or the result of Presence.list? Also, do they differ just in excess processes or the metadata itself?

I went through the changes a couple times and so far haven't pinpointed anything that would introduce any meaningful difference.

arjan commented 1 year ago

I noticed it because our stats use the output of Presence.list and the counts were way off yesterday morning. Hopefully I can reproduce it today.