iron-io / issues

For Iron.io services issue tracking. Public facing issue tracking for behind the scenes issues.
5 stars 0 forks source link

Managing IronWorker errors under high-volume seems incredibly difficult #61

Open calebjclark opened 11 years ago

calebjclark commented 11 years ago

We're running into a feature gap with IronWorker that is causing us to rethink we should use it in production. We love so much about IronWorker that we're hoping someone can help us with a workaround. The problem is in the features IronWorker provides for managing errors.

When there's an error in IronWorker, you can see the stats and details in HUD as well as from the API, which is great. When the errors first start happening all the stats are correct -- if Iron worker says there are 5 errors then there are 5 errors.

However, once you rerun a task the error stats become almost unusable. If the rerun task was successful, IronWorker still shows 5 errors. There's no way to remove the task that was fixed and has now been run successfully (not from HUD and not from the API). If the rerun task generated an error then IronWorker shows you having 6 errors with, again, no way to remove the initial error.

The more active a code package the faster everything spirals out of control, especially when it's important to fix and finish processing all errors. If I have 12 errors are those from new errors coming in (and therefore need to be fixed and rerun) or are they from past tasks that have already rerun successfully? The task IDs of a rerun task is different from the original taskID so it's impossible to even use the API to track errors needing attention from errors resolved.

While we're in development mode, things work ok. We just manually remember which is which, but we can't figure out a system that will work in production.

How has this issue been handled by others?

Would it be possible to add a feature to delete the original task after re-running it? I believe that would completely solve our problem.

Thanks!

carimura commented 11 years ago

Caleb sorry for the slow response -- we've actually been circulating this internally and it's inspired a set of conversations for how to improve this area. We'll reach out very soon for your thoughts.

Additionally -- you can feel confident that we'll have an answer that should satisfy your needs.

cc @treeder