hysds / hysds-framework

HySDS framework releases
Apache License 2.0
4 stars 7 forks source link

Document pge dev best practice for hard/soft time limits #6

Open ldang opened 6 years ago

ldang commented 6 years ago

Update the job spec (hysds-io) page (HySDS github wiki) to highlight a gotcha when it comes to hard/soft time limits.

PGEs can specify a soft and hard time limit, which will cause the worker to timeout a job if it runs too long.

We had been seeing issues where there was an inconsistency with Mozart showing a job being in a started state, when it was no longer in the system. This appears to happen when the soft and hard timeout limits are the same.

Our design was: Soft time limit should do sigterm Hard time limit should do sigkill -9

We think there may be a race condition where celery will attempt to kill the task based on both the soft and hard timeout. It killed sigkill -9 verdi too quickly after sigterm

We could recommend that PGE developers never set the soft and hard time limit to be the same. We could further add in a container builder that will put in a check to fail the build if the soft and hard time limits are the same.

ldang commented 6 years ago

New thought is to only require PGE developer to set the soft limit, and have the system calculate a default hard limit based on soft limit + fudge factor of, say, 5 minutes.

@pymonger brought up the following concern that we need to be careful or make a judgement call about the soft time limit for sciflo jobs that are blocking on spawned jobs. We can set it really high soft time limit to avoid prematurely terminating the entire workflow, but take on the risk of hanging jobs keeping us from running more sciflo jobs. Or we can increase the number of workers for the queue to support more concurrent sciflo jobs.

ldang commented 6 years ago

Another concern when estimating the soft time limit, is to make sure the developer accounts for the maximum runtime rather than the average. Otherwise, we will never be able to complete the job.

Examples might be crawler jobs that pull data from a data provider. Although these jobs might not take long to run on a regular basis, we may experience occasional downtimes where we accumulate a backlog and the crawler will take much longer to run than normal.