reanahub / reana-workflow-engine-snakemake

REANA Workflow Engine Snakemake
MIT License
0 stars 22 forks source link

workflow engine stuck waiting for already-finished jobs #59

Closed mdonadoni closed 1 year ago

mdonadoni commented 1 year ago

This happened when running https://github.com/reanahub/reana-workflow-engine-snakemake/pull/42#discussion_r859837148

Only run-batch-... is running, all the run-job-... pods have finished:

$ kubectl get pods | grep run-
reana-run-batch-7c0ebe80-6cf3-44df-9899-105ffa5ab062-f2j8k   3/3     Running   0          57m

job-controller has cleaned up all the jobs (175):

$ kubectl logs reana-run-batch-7c0ebe80-6cf3-44df-9899-105ffa5ab062-f2j8k -c job-controller | grep 'Cleaning Kubernetes job' | wc -l
175

According to job-controller, all the job have finished:

$ kubectl exec reana-run-batch-7c0ebe80-6cf3-44df-9899-105ffa5ab062-f2j8k -c job-controller --  curl localhost:5000/jobs > jobs.json
$ cat jobs.json | jq '.jobs[] | values[].status' | wc -l
175
$ cat jobs.json | jq '.jobs[] | values[].status' | grep finished | wc -l
175

r-w-e-snakemake confirms that 175 jobs were submitted, however only 171 have finished:

$ kubectl logs reana-run-batch-7c0ebe80-6cf3-44df-9899-105ffa5ab062-f2j8k -c workflow-engine | grep 'submitted job:' | wc -l
175
$ kubectl logs reana-run-batch-7c0ebe80-6cf3-44df-9899-105ffa5ab062-f2j8k -c workflow-engine | grep 'job is finished.' | wc -l
171
$ kubectl logs reana-run-batch-7c0ebe80-6cf3-44df-9899-105ffa5ab062-f2j8k -c workflow-engine | tail
2023-07-17 13:29:58,697 | snakemake.logging | Thread-1 | INFO | Finished job 124.
2023-07-17 13:29:58,697 | snakemake.logging | Thread-1 | INFO | 169 of 176 steps (96%) done
2023-07-17 13:29:58,701 | reana-workflow-engine-snakemake | Thread-1 | INFO | make_data job is finished. job_id: 76807136-32ca-4349-ab4d-1b32d7df8bb8
2023-07-17 13:29:58,702 | snakemake.logging | Thread-1 | INFO | [Mon Jul 17 13:29:58 2023]
2023-07-17 13:29:58,702 | snakemake.logging | Thread-1 | INFO | Finished job 139.
2023-07-17 13:29:58,702 | snakemake.logging | Thread-1 | INFO | 170 of 176 steps (97%) done
2023-07-17 13:30:08,720 | reana-workflow-engine-snakemake | Thread-1 | INFO | make_data job is finished. job_id: 9d02883d-c7da-4f67-909a-72fd53db0dbe
2023-07-17 13:30:08,721 | snakemake.logging | Thread-1 | INFO | [Mon Jul 17 13:30:08 2023]
2023-07-17 13:30:08,721 | snakemake.logging | Thread-1 | INFO | Finished job 154.
2023-07-17 13:30:08,721 | snakemake.logging | Thread-1 | INFO | 171 of 176 steps (97%) done

In the database four jobs are still reported as running:

reana=# select id_, backend_job_id, status from __reana.job where status != 'finished';
                 id_                  |                   backend_job_id                   | status  
--------------------------------------+----------------------------------------------------+---------
 607da740-1199-4cba-9ab7-2cb9ac5772a8 | reana-run-job-160bec2f-86bf-47bc-a208-f0635c3632e4 | running
 b5ffb667-6910-4e3f-be6e-21c13ca49161 | reana-run-job-f410a704-794b-4cac-8ad3-687644107acb | running
 985e6a86-3bc2-4fee-86b0-9e513e80b3f6 | reana-run-job-16380469-6e75-4f9f-98b0-cfd4f4f03d5e | running
 49f6e328-00ca-4ca9-a6ed-dc7602e6b9fd | reana-run-job-291c18c3-ccff-4b16-a300-0cdb37671f1c | running
(4 rows)
mdonadoni commented 1 year ago

Additional note: reana.yaml

inputs:
  files:
    - Snakefile
workflow:
  type: snakemake
  file: Snakefile
  resources:
    kerberos: true
outputs:
  files:
  - myoutput.png