thewca / worldcubeassociation.org

All of the code that runs on worldcubeassociation.org
https://www.worldcubeassociation.org/
GNU General Public License v3.0
328 stars 176 forks source link

Teams are cached for too long #4535

Closed lgarron closed 2 years ago

lgarron commented 5 years ago

While debugging tests for #4534 we found that Team.wrc was cached across tests. In theory, this could have repercussions in production (although the impact is limited in practice, because team members don't change often).

Jambrose777 commented 5 years ago

I believe this is the relevant code:

https://github.com/thewca/worldcubeassociation.org/blob/eaac15363afd997b80a21e62cb24a9ee1f8736bd/WcaOnRails/app/models/team.rb#L70-L73

viroulep commented 5 years ago

I would assume that clearing the class variable whenever a team member change is made should be sufficient; however we have two worker running in separate processes in production so we would only have one update. Another approach would be to take into account some updated_at field for the team when building the @@teams_by_friendly_id variable, and update that field when a team member is added/removed/changed.

viroulep commented 4 years ago

Completing this issue: this actually raises issues when syncing mailing lists, and may not change the members appropriately until the cache is cleared (ie: the unicorn restart we do when we deploy).

Checking the updated_at field kind of ruins the point of caching as we'd have to emit a select. I don't have a great suggestion for this; another alternative would be to send a restart signal to unicorn before going into the daily jobs.

viroulep commented 4 years ago

Turns out, the process running the Delayed::Job is different from the unicorn one, so we'd probably need to do both restart_app and restart_dj.

viroulep commented 4 years ago

Another alternative: pass some force_reload optional argument when fetching a team, that we can pass when calling Team.board in the rake task.

gregorbg commented 2 years ago

This error occured again recently and is critical to some operational WCA procedures (Delegates falsely detected as being on probation, confidential mails still being forwarded to people who have been removed from the team). Unfortunately, I don't have sufficient insight into our Rails caching to be able to figure this out.

Summoning our resident sage @jonatanklosko for your Rails and infrastructure expertise 🙏

As the teams page on the WCA page is public, I think it's fine to share the following bit of diagnosis: 2015MATT05 is no longer a WQAC member. SSHing into the production machine and querying WQAC team members through the bin/rails c console prompt indeed shows that Enzo is not listed as member. But the background jobs keep adding him to the Google Workspace mailing list, even when we manually remove him through the web admin interface. So clearly, he (i.e. the Team.wqac team with all its members) is cached somewhere. But I don't know how to proceed from here 😕

jonatanklosko commented 2 years ago

To summarize the problem, we query teams once and store them in the @@teams_by_friendly_id class variable (that's the cache in question). The variable is never invalidated, until we restart the given app process.

Because there are multiple separate processes I don't think there's a straightforward way of invalidating the cache in all of them on demand (when a member is added/removed).

One option I see is invalidating this cache periodically, so in

https://github.com/thewca/worldcubeassociation.org/blob/eaac15363afd997b80a21e62cb24a9ee1f8736bd/WcaOnRails/app/models/team.rb#L70-L73

we could do something like this

@@teams_by_friendly_id ||= nil

if @@teams_by_friendly_id.nil? || @@teams_by_friendly_id_timestamp < 15.minutes.ago
  @@teams_by_friendly_id = all.with_hidden.index_by(&:friendly_id)
  @@teams_by_friendly_id_timestamp = DateTime.now
end

@@teams_by_friendly_id

This way we never use cached teams older than 15 minutes. Given we sync mailing list every hour, with this approach the worst-case propagation time would be 1 hour and 15 minutes.

@viroulep let me know if you have any thoughts on this :)

viroulep commented 2 years ago

Just for the record: I don't have more to add to Jonatan's analysis; I did remember this issue happened once in the past, but I guess we let it go, maybe because at the time we had frequent enough deploys (or less team changes) so that issue wasn't very relevant.

I definitely support Jonatan's workaround! Unfortunately there is no way to properly invalidate the cache on team change because we have multiple instances running in different processus :s