So I'm not sure I'd call this a points of pain overview so much as good docs ;P.
We've discussed a bunch offline; two thoughts that are really only relevant here: your scheduler failure modes are simple bugs; they should be fixed in-situ I think, because that can be done quickly (to whit: don't de-and-requeue things when there is no work slot available - thats not the task failing; use system metrics to inform work slot availability (e.g. if there is io overload, don't schedule more work); immediately place work when slots are freed up (e.g. schedule work immediately at the end of your cleanup of a work slot), cap exponential backoff (e.g. at 5 minutes), discard work after (say) 10 attempts, and finally implement a quick-reset mechanism to zero the queue and allow an immediate restoration of service without mucking around.
So I'm not sure I'd call this a points of pain overview so much as good docs ;P.
We've discussed a bunch offline; two thoughts that are really only relevant here: your scheduler failure modes are simple bugs; they should be fixed in-situ I think, because that can be done quickly (to whit: don't de-and-requeue things when there is no work slot available - thats not the task failing; use system metrics to inform work slot availability (e.g. if there is io overload, don't schedule more work); immediately place work when slots are freed up (e.g. schedule work immediately at the end of your cleanup of a work slot), cap exponential backoff (e.g. at 5 minutes), discard work after (say) 10 attempts, and finally implement a quick-reset mechanism to zero the queue and allow an immediate restoration of service without mucking around.