illinois-cs241 / broadway

A distributed systems framework used running distributable workloads.
Other
18 stars 0 forks source link

Test Reconnecting Nodes #19

Open ayushr2 opened 4 years ago

ayushr2 commented 4 years ago

We should test that when a node reconnects, their info is sustained and they can continue from where they left.

zhengyao-lin commented 4 years ago

There is one scenario in the websocket version of the protocol that's currently problematic:

When API restarts, graders will be interrupted with a disconnection exception. If there are ongoing jobs in graders in the restarting period, API will think the grader is still working on the original job (since in the http protocol graders would still continue the job and submit in this case). The "running" flag of such grader is not cleared in API once the restart is done.

I think we should probably:

  1. add reconnecting mechanism in the websocket version of grader instead of relying entirely on docker (and try to preserve the status of an on-going job as long as possible)
  2. add some kind of draining mechanism in API to temporarily block incoming grading requests when we are expecting a restart to happen (e.g. deploying new version)
ayushr2 commented 4 years ago

Nice idea. We would need to drain the queue before we can shutdown the API.