kevinkreiser / prime_server

non-blocking (web)server API for distributed computing and SOA based on zeromq
Other
59 stars 26 forks source link

Job Cancellation #45

Closed kevinkreiser closed 7 years ago

kevinkreiser commented 8 years ago

When workers are performing relatively long tasks and the client no longer needs the result of such a task (maybe signaled via client disconnect?) we should be able to tell a given worker to abort his work. There are a few questions here though.

  1. Who controls whether or not the work is aborted? I think in the most rigorous sense you'd like to be sure the work was ended as soon as possible. The problem though is that the thread or process who knows about the cancellation isnt the same one doing the work. So this option is out. Which leaves us with the worker has to decide when its done. The issue here is that the worker doesnt even control its work. Indeed that functionality is injected into the workers work function. So the trick will be able to find a way to allow the work function access to information that the worker is getting.
  2. How do we notify the workers work function about aborting? Lets assume that the worker itself knows it should abort (we'll talk about how in another bullet point). How does the worker communicate this information to the work function. The obvious way to do it would be to have the work function poll this state directly. To do this we'll need to expand the prototype for the work function to include a polling function to allow the worker to hear about aborting its job. We can have the polling function throw a custom exception when an abort matches their current job id. This can be caught outside of the work function, logged, and then let the worker tell the proxy its ready for more work again. note that this requires the implementer of the work function to actually call the polling function periodically. they have the option not to.
  3. How do we tell a given worker that their current job is to be cancelled? The issue here is that of synchronization. Sure the server at the top knows that the job is outstanding but it knows neither which stage of the pipeline its on nor which worker has it. Indeed a job could be returned before the abort made it to the right place. So anyway all this boils down to is that we have two options. Either we do a bunch of bookkeeping (might need at least some of this for timeouts etc anyway) or we broadcast the abort message and assume it happened. I personally like the latter. Even if its udp style broadcasting where its not gauranteed to get there it should be good enough 99% of the time. The upside of this approach is that we wouldnt have to shuffle messages through the pipeline (essentially flood filling the tree) to be sure we found the right one. Even with perfect bookkeeping you couldnt know exactly where to send the abort. Unless every proxy remembered every request it forwarded for a decently long amount of time. This sounds kludgy. So yeah either floodfill or broadcast. The broadcasting might also be easier once we merge the outstanding zbeacon pr.
  4. What triggers an abort? The job took longer than the configured time limit. In which case the server replies to the client that it timed out and tells the worker to abort. The second option would be the client makes a request but then disconnects. In which case no response is needed but we would like to not waste time on completing a job we aren't sending anywhere. You could also allow for a specific request to abort another request. Since the server knows who the client is it could abort any requests that the client has outstanding. There are some questions about having this behind other load balancers and whether or not keepalive might make it impossible for the client to correctly talk to the server servicing the request.
kevinkreiser commented 8 years ago

@noblige the above is a sketch to allow for cancelling requests/jobs. let me know what you think

noblige commented 8 years ago

@kevinkreiser I think at a minimum client disconnect should trigger an abort - this will save resources and also provides client with straightforward way to cancel synchronous operations. As for implementation - broadcasting abort requests (with some sort of job identifier) and having workers to poll, sounds like a reasonable approach.