reanahub / reana-workflow-controller

REANA Workflow Controller
http://reana-workflow-controller.readthedocs.io/
MIT License
2 stars 38 forks source link

job pods keep running even after workflow failure #546

Closed mdonadoni closed 9 months ago

mdonadoni commented 11 months ago

The pod reana-run-batch-... is terminated as soon as one job of the workflow fails, even if there are more jobs running: https://github.com/reanahub/reana-workflow-controller/blob/3004b14a7d60eb39dfcbdc51e15242b27edd70c3/reana_workflow_controller/consumer.py#L163-L170

This means that those jobs will outlive reana-run-batch-..., the k8s pod are not cleaned up and the database is not updated.

How to reproduce:

  1. Prepare workflow files (see below)
  2. Run the workflow

One job will fail, the other one will continue running even after reana-run-batch-... is terminated. The job pod will not be cleaned up either.

values-dev.yaml ```diff REANA_RATELIMIT_SLOW: "5 per second" reana_workflow_controller: image: docker.io/reanahub/reana-workflow-controller - environment: - REANA_RUNTIME_KUBERNETES_KEEP_ALIVE_JOBS_WITH_STATUSES: failed + # environment: + # REANA_RUNTIME_KUBERNETES_KEEP_ALIVE_JOBS_WITH_STATUSES: failed reana_workflow_engine_cwl: image: docker.io/reanahub/reana-workflow-engine-cwl reana_workflow_engine_yadage: ```
reana.yaml ```yaml version: 0.9.0 inputs: files: - Snakefile workflow: type: snakemake file: Snakefile ```
Snakefile ```snakefile rule all: input: "r1.txt", "r2.txt", rule r1: output: "r1.txt" container: "docker://docker.io/library/python:3.8-slim" shell: "sleep 120; echo done > r1.txt" rule r2: output: "r2.txt" container: "docker://docker.io/library/python:3.8-slim" shell: "exit 1" ```