Closed liyawang closed 5 years ago
Deadlock still happens and it is related to the 'sleep' process that is blocking the database. The 'sleep' process could come from the operations we implemented to retry more than one times when talking to Agave?
We only used 'SLEEP' when submitting or retrieving jobs failed. We only try three times (both UI and API) and ask to sleep for 1 second.
When a workflow is submitted, agave will return job id that we will use to update the workflow.
Fixed by killing long-running 'SLEEP' process.
Line 147 'sleep 2m' is added to this script (https://github.com/warelab/misc/blob/master/Liya/runWorkflow.sh) to avoid deadlock on SciApps database. The deadlock is created when we use the API to update the metadata of a workflow right when the workflow record is created in the SciApps database. Holding a while (e.g. 2 minutes) before updating the metadata will resolve the issue. Or we could add preventions in the API code: http://wiki.apidesign.org/wiki/Deadlock
SciApps log file has also been updated to capture the time information.