Closed woodsaj closed 8 years ago
Comment by woodsaj Thursday Aug 06, 2015 at 05:25 GMT
https://github.com/raintank/grafana/commit/ab1d2ddf48565e8e6d0c08fae15a92b97cece315 doesnt work as i think you expect it to. If a job is NACKed and sent back to rabbitmq, it will get re-processed within a short time frame. However because the job is already in the cache it will get marked as already done and removed from the queue.
I am also not a big fan of re-queuing all failed messages in general, if there is a fault and all jobs are being re-queued, then the rate of messages being sent to rabbit will accelerate. ie, if we are pushing 30 jobs/second and there is a fault the number of jobs will grow by 30/second. so after 5minutes we are pushing over 9000 jobs/second compared to the expected 30jobs we should be seeing.
Comment by Dieterbe Thursday Aug 06, 2015 at 19:40 GMT
I am also not a big fan of re-queuing all failed messages in general,
yup this PR is a bit out of date as my point of view has changed on this matter too (see #367), i'm now also more in favor of doing less rescheduling.
If a job is NACKed and sent back to rabbitmq, it will get re-processed within a short time frame. However because the job is already in the cache it will get marked as already done and removed from the queue.
nice catch! perhaps if we decide to execute (after doing the atomic ContainsOrAdd) and hit an error, we should remove it from cache in case we decide to reschedule the job
both items IMHO reinforce that we should be thoughtfull about which kinds of errors result in "complete job with status=unknown, no error i.e. send ack" and which would trigger an error/NACK, but let's continue that discussion in #367
Issue by Dieterbe Friday Jul 24, 2015 at 22:24 GMT Originally opened as https://github.com/raintank/grafana/pull/371
putting my preferred strategy from #367 in place
Dieterbe included the following code: https://github.com/raintank/grafana/pull/371/commits