raintank / worldping-api

Worldping Backend Service
Other
25 stars 18 forks source link

Errors #16

Closed woodsaj closed 8 years ago

woodsaj commented 8 years ago

Issue by Dieterbe Friday Jul 24, 2015 at 22:24 GMT Originally opened as https://github.com/raintank/grafana/pull/371


putting my preferred strategy from #367 in place


Dieterbe included the following code: https://github.com/raintank/grafana/pull/371/commits

woodsaj commented 8 years ago

Comment by woodsaj Thursday Aug 06, 2015 at 05:25 GMT


https://github.com/raintank/grafana/commit/ab1d2ddf48565e8e6d0c08fae15a92b97cece315 doesnt work as i think you expect it to. If a job is NACKed and sent back to rabbitmq, it will get re-processed within a short time frame. However because the job is already in the cache it will get marked as already done and removed from the queue.

I am also not a big fan of re-queuing all failed messages in general, if there is a fault and all jobs are being re-queued, then the rate of messages being sent to rabbit will accelerate. ie, if we are pushing 30 jobs/second and there is a fault the number of jobs will grow by 30/second. so after 5minutes we are pushing over 9000 jobs/second compared to the expected 30jobs we should be seeing.

woodsaj commented 8 years ago

Comment by Dieterbe Thursday Aug 06, 2015 at 19:40 GMT


I am also not a big fan of re-queuing all failed messages in general,

yup this PR is a bit out of date as my point of view has changed on this matter too (see #367), i'm now also more in favor of doing less rescheduling.

If a job is NACKed and sent back to rabbitmq, it will get re-processed within a short time frame. However because the job is already in the cache it will get marked as already done and removed from the queue.

nice catch! perhaps if we decide to execute (after doing the atomic ContainsOrAdd) and hit an error, we should remove it from cache in case we decide to reschedule the job

both items IMHO reinforce that we should be thoughtfull about which kinds of errors result in "complete job with status=unknown, no error i.e. send ack" and which would trigger an error/NACK, but let's continue that discussion in #367