Closed kenkku closed 10 years ago
I'd say no, because we can expect runner to retry the task later.
Actually we have an unwritten policy that fatal errors should be logged as errors, things that talk to third parties (database and gearman messages) should be notices (so that users see them by default), important events as infos and dev messages as debugs.
The retry controller ejects a task after giving up, which is, as far as I know, the intended result.
The runner is responsible for retrying task execution if controller for some reason fails to execute the task.
However it is not a part of "normal" operation that the controller fails to complete the work through a valid end user worker. If this happens, it should be logged as an error. If controller is configured to retry the task it should do so, otherwise it should just leave the task to the database for the runner to retry later.
Controller should eject the task from the database if and only if the end user worker for the submitted func_name has completed successfully. Ejecting after giving up completely breaks the runner level retries so it should be only custom extra controllers that have such logic in them. The default retry controller should not do such a thing.
What if the end user worker explicitly fails the task with WORK_FAIL? Currently the task is ejected also in this case.
How do you distinguish explicit fail notice from worker timeout?
Having a default retry controller that by default ejects a task when the task timeouts does not really sound what I would expect it to do.
In my view that basic principle for default behaviour should be:
If job configuration does not explicitly state that gearsloth should stop trying to run it after X tries, gearsloth should try to run the task indefinitely until it is completed succesfully.
The component responsible for keeping this retrying happening should be the runner.
If the job configuration does not explicitly state that the default retry controller should eject the task in certain situations, the controller should eject the task only when the task is completed succesfully.
The purpose of retrycontroller retries is to allow near immediate retrying in case only some workers behave badly and some still can complete the task as expected.
This is not about the retry controller ejecting a task when it timeouts by default, there is a configurable retrycount, after which the controller gives up. The default is not giving up. I think it makes no sense to not eject a task after the configured retry count (if it's not set to infinite), that would result in the controller trying to run the task more times than the user configured (if the runner re-sends the task since it's not ejected).
If there is a controller which decides to eject the task even though it is not run properly, I believe the task should be disabled instead of cleared from the database.
The runner already has a logic like this present. If we aim to give the controller the power to "prematurely stop retrying" and thus circumvent the whole runner retry logic, I think the controller should have the same options as the runner has: Either to clear the task from the DB or to mark the task as disabled in the db.
The current thinking has been that the controller does not have this kind of a power. It's task is only to run the task, never to disable a task without running it completely.
I will iterate the thinking behind these two types of retry points:
The runner retry is meant to guard against larger system failures where all of the workers are down and it might take a while for them to come up again.
The controller retry is meant to guard against individual malfunctioning workers where submitting the job immediately after a failure might result in a run that completes successfully.
The new controller logic that was introduced a week ago fixes this issue as it logs an error each time a worker task fails when executed from the controller. I will close this and create an another task to allow the controller the ability to disable tasks instead of just ejecting them as "done".
Should the logging level for this be bumped up? Giving up on a task is not normal operation - maybe it should be a notice or even a warning.