Open utterances-bot opened 2 years ago
Awesome write up! There's a similar article from Dropbox for their generic task scheduler system.
A few questions / comments to start a discussion
Thank you for writing this article!
Awesome write up! There's a similar article from Dropbox for their generic task scheduler system.
Thanks -- updated the article with a "References" section which contains a link to the Dropbox article and also another youtube mock-interview which tries to solve the same problem.
A few questions / comments to start a discussion
1. "Visibility timeout" is a cool concept! Thanks for highlighting it and i'll read more about it in the Amazon SQS doc!
Ya, whenever we have a setup where there's a queue and a pool of workers, the concept of leasing work-items is worth discussing. Otherwise, multiple workers might end up having the same work-item.
Apart from leasing, it'd also be good to mention the idempotent property of the nature of work (if the nature of work is not idempotent, it might be worth diving further into the topic during the interview)
2. If the submitted code has a bug that keeps crashing the workers, what's the strategy to get out of the "infinite loop" (i.e. workers keep trying the code and keep crashing)?
That's a great point! I didn't go into the details of this case in this article. This is another topic that can be discussed when we have a system built around queues (this will be a good candidate for another "Talking point" section in this article. I will try to add it when I get time).
One way to deal with this is via "dead letter queues" (looks like SQS supports this). SQS can keep track of the number of times a message has been "delivered" to a consumer (in our case, this happens whenever a worker dequeues a message). If the number of deliveries exceeds a certain threshold, it means that we've not been able to process the item after several retries. We can then configure our queue to automatically move this message to another queue (dead-letter-queue). We should set up monitoring on the dead-letter-queue items and manually check those messages to see what went wrong, and why they have been isolated here. Apart from isolating the message into another queue, we should also update the UserSubmissions table's status column regarding this item (otherwise, this item will be in "ENQUEUED" status forever). For this, we should have a separate loop (maybe in the Controller itself, or create a "Retry" component) which looks for workitems which have not been updated for more than, say, 5 minutes. If so, the Retry process can change the status to "FAILED". That way, the user's submission status will also be (eventually) updated (to maybe something like "Internal Error").
3. Can you please go into a little more detail what happen when a queue crashes and how it recovers from a crash?
I don't know the exact details, but then I'm guessing the items in the queue should be persisted to disk whenever we enqueue/dequeue items. That can help SQS (or any other queue) re-build state from disk in case of a crash. For understanding this, we might need to know internals of SQS or a similar system -- which I am not very familiar with. (I'll try to read up on that if I get time).
Thank you for writing this article!
Thanks for reading the article and adding valuable comments!
Online judge (like leetcode, hackerrank, codechef) | sys-design-interview.github.io
https://sys-design-interview.github.io/online-judge.html