Closed WadeWaldron closed 8 years ago
cross-posting exchange from slack
shergill [10:47 AM]
i have a few thoughts. but my first thought is the fact that it tried to resume from failed state tells me that we didn’t do appropriate cleanup on encountering the double load failure encountered in the second run yesterday.(edited)
[10:47]
that would be the first step, imo
wade.waldron [10:48 AM]
@shergill: The Double Load scenario occurred because we were testing the deploy of EigenFlow. I deliberately did the double load expecting that it would encounter this scenario.
shergill [10:49 AM]
wade.waldron: agreed. but that was yesterday. after DL was encountered, the state should have been restored to where it was before DL was encountered
[10:49]
so that if you tried to run the third time yesterday you’d run into DL again, but if you tried to run job today, it would go ahead
[10:50]
now we could do it by “resetting” state. or by tweaking the interpreter which interprets the state information regd job runs etc
wade.waldron [10:52 AM]
@shergill: I think I follow. You are saying that after a Double Load is encountered, something (person or machine) needs to basically reset the state to say that the double load run should be disregarded. Or something to that effect.
[10:53]
Right now I am working on clearing that state manually. However that's not a good long term solution (hence the discussion).
shergill [10:54 AM]
wade.waldron: yes. one way to do so would be to update the log/state journal we have with a “disregard uptil” message. and then the interpreter which parses and interprets this would do the right thing (obey the directive)
Example:
val download = Downloading {
...
} onFailure {
case NetworkIssue => Retry (3.seconds, 10)
case DoubleLoad => SkipRun
}
How do we define a double load? Is it when we see that a job is running today and we know that the next processing date is tomorrow? In that case, shouldn't we automatically exit because we know that we're running before the next processing date?
@yawaramin that is correct, but the system maybe forced to re-run starting from a specific day/time. In this case it will ignore the fact that it already ran that period. When a process has a double load protection the process may fail and the problem is that it will stuck there until forced to process another day manually.
EigenFlow should have a feature that allows a job to terminate early. This feature would allow a job to determine that it has no need to continue through the later phases (for example in a double load scenario). In this case the job should be able to specify that rather than failing we just want to terminate the run early, and then the next run can continue as normal.