Open kmavrommatis opened 4 years ago
If tasks are getting stuck in the INITIALIZING state, it probably indicates that the scheduler (e.g. AWS Batch) is killing the jobs for some reason or another and the state update from the worker isn't making it to the database. You can turn on state reconciliation for your backend which may help:
Relevant config section: https://github.com/ohsu-comp-bio/funnel/blob/master/config/default-config.yaml#L276-L282
Code doc: https://github.com/ohsu-comp-bio/funnel/blob/master/compute/batch/backend.go#L140-L155
It probably wouldn't be all that hard to implement a routine that periodically scans QUEUED/INITIALIZING/RUNNING tasks and cancels them if they hit some sort of wall time specified in the config. However, it seems to me that this would just be masking an underlying issue.
Thanks for the pointers.
I enabled the reconciliation but I could not see any improvement (set it to check every 30m).
I occasionally have jobs that are stuck either to INITIALIZATION or RUNNING state for days (until I kill them. The only common thread I have found between those is that they are stuck at stages that require transfers of many files (e.g. >40 files) each of several GB in size. Unfortunately, this is not reproducible, i.e. if I start the same job again it will probably go through. I was wondering if this is really a network problem. Check for example the following plot.
It comes from a job that has finished running, and is stuck transferring files to s3 for hours. Initially it starts transferring with high speeds and then drops to a constant very slow speed. I have had similar plots for all other stuck jobs i checked. I wonder if this is a result of trying many parallel transfers, or there is an IO block somewhere. In all these cases the funnel task is in uninterruptible sleep caused (presumably) by I/O.
I've added an option to the worker config to limit the number of concurrent uploads/downloads. The default value is 10.
Hi, is there a way to set some default time limits for each state of a job? e.g. if a job stays in INITIALIZING state for over 6h then consider it failed, cancel it and transition to an CANCELLED or ERROR state? Thanks in advance for your help