Plan for scaling the scoring service

grymmy commented 1 year ago

Description

If we intend on scaling up the scoring service to multiple instances to increase throughput of the scoring service, a plan should to made for how that would be done. Things we might want to ask ourselves:

How many instances of the scoring service will we provision for launch day?

How is the work queue for each individual service handled? (global queue backed by database, each instance has its own work queue)

What is the consequence of an individual instance dying? (will scoring requests be lost? will they be retried?)

Screenshots

No response

Additional information

No response

alhardwarehyde commented 1 year ago

@polytroper thoughts on this?

polytroper commented 1 year ago

Oof. So many things, most of which are out of my wheelhouse. Here is my best guess.

The twitter thing is a bit of an odd case because that is going to be backed up in a way by zapier, which should retain the ability to retry requests until they succeed. However, if for any reason the zap does not capture these requests they will be lost. Ultimately we want a system that is able to search for tweets that tag @SineRiderGame in chronological order and reach back into the past to the last point it can verify all prior solutions have been scored. For the reddit bot we do it via the reddit API. For the twitter bot maybe we do it through Tweetdeck and a Selenium automation, for now anyway.

When the bots find a solution to score, they immediately put it in the airtable and ping one of n scoring server instances to notify it of new solutions. When a scoring server boots up or is pinged, it begins scoring any unscored solutions. When scoring is complete it marks the solution as scored in Airtable and hits an endpoint on the appropriate chatbot server to post the reply. Upon successful confirmation the chatbot marks the solution as replied in Airtable.

I think this architecture should be fault-tolerant for all servers, and easily scalable for the compute-heavy scoring server.

grymmy commented 1 year ago

The plan is as follows:

1) We have levers we can pull in the scoring service to trade off quality vs speed of videos that we create. 2) We have implemented a simple concurrency limiting feature in the scoring service to rate-limit how many videos can be generated at any given time on a single instance - currently constrained to 1, but we can experiment w/ raising this and seeing if it causes problems. 3) If needed, we can add a round-robin load balancer in front of sinerider-scoring.hackclub.com and add new scoring instances to the cluster.

hackclub / sinerider