reanahub / reana-workflow-engine-yadage

REANA Workflow Engine Yadage
http://reana-workflow-engine-yadage.readthedocs.io/
MIT License
0 stars 35 forks source link

workflow fails but is reported "pending" when initdata parameter fails #216

Closed tiborsimko closed 2 years ago

tiborsimko commented 2 years ago

Current behaviour

When workflow contains an error, such as the following in the roofit example:

diff --git a/reana-yadage.yaml b/reana-yadage.yaml
index bdf91d7..0d8e31a 100644
--- a/reana-yadage.yaml
+++ b/reana-yadage.yaml
@@ -6,7 +6,7 @@ inputs:
   directories:
     - workflow/yadage
   parameters:
-    events: 20000
+    nevents: 20000
     gendata: code/gendata.C
     fitdata: code/fitdata.C
 workflow:

The workflow fails:

$ kubectl logs reana-run-batch-e5f0bcb0-ea37-4238-8dd3-8400117d5fb1-jhvcs  workflow-engine 
2021-12-01 11:13:35,329 | yadage.creators | MainThread | INFO | initializing workflow with initdata: {'fitdata': 'code/fitdata.C', 'gendata': 'code/gendata.C', 'nevents': 20000} discover: True relative: True
2021-12-01 11:13:35,329 | adage.pollingexec | MainThread | INFO | preparing adage coroutine.
2021-12-01 11:13:35,329 | adage | MainThread | INFO | starting state loop.
2021-12-01 11:13:35,369 | yadage.wflowview | MainThread | INFO | added </init:0|defined|unknown>
2021-12-01 11:13:35,464 | yadage.handlers.expression_handlers | MainThread | INFO | matches
2021-12-01 11:13:35,464 | yadage.handlers.expression_handlers | MainThread | ERROR | no matches found for selection events in result <TypedLeafs: {'fitdata': '/var/reana/users/00000000-0000-0000-0000-000000000000/workflows/e5f0bcb0-ea37-4238-8dd3-8400117d5fb1/code/fitdata.C', 'gendata': '/var/reana/users/00000000-0000-0000-0000-000000000000/workflows/e5f0bcb0-ea37-4238-8dd3-8400117d5fb1/code/gendata.C', 'nevents': 20000}>
2021-12-01 11:13:35,464 | adage | MainThread | ERROR | some weird exception caught in adage process loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/adage/__init__.py", line 51, in run_polling_workflow
    for stepnum, controller in enumerate(coroutine):
  File "/usr/local/lib/python3.8/site-packages/adage/pollingexec.py", line 92, in adage_coroutine
    update_dag(controller, extend_decider,recursive_updates)
  File "/usr/local/lib/python3.8/site-packages/adage/pollingexec.py", line 48, in update_dag
    update_dag(controller, decider, recurse)
  File "/usr/local/lib/python3.8/site-packages/adage/pollingexec.py", line 43, in update_dag
    update_loop.send(command)
  File "/usr/local/lib/python3.8/site-packages/adage/pollingexec.py", line 21, in update_coroutine
    controller.apply_rules([rule])
  File "/usr/local/lib/python3.8/site-packages/adage/wflowcontroller.py", line 53, in apply_rules
    ctrlutils.apply_rules(self.adageobj, rules)
  File "/usr/local/lib/python3.8/site-packages/adage/controllerutils.py", line 127, in apply_rules
    rule.apply(adageobj)
  File "/usr/local/lib/python3.8/site-packages/yadage/stages.py", line 52, in apply
    self.rule.apply(WorkflowView(adageobj, self.offset))
  File "/usr/local/lib/python3.8/site-packages/yadage/stages.py", line 101, in apply
    self.schedule()
  File "/usr/local/lib/python3.8/site-packages/yadage/stages.py", line 144, in schedule
    scheduler(self, self.stagespec)
  File "/usr/local/lib/python3.8/site-packages/yadage/handlers/scheduler_handlers.py", line 198, in singlestep_stage
    parameters = {
  File "/usr/local/lib/python3.8/site-packages/yadage/handlers/scheduler_handlers.py", line 199, in <dictcomp>
    k: select_parameter(stage.view, v)
  File "/usr/local/lib/python3.8/site-packages/yadage/handlers/scheduler_handlers.py", line 49, in select_parameter
    value = handler(wflowview, parameter)
  File "/usr/local/lib/python3.8/site-packages/yadage/handlers/expression_handlers.py", line 160, in stage_output_selector
    return select_reference(
  File "/usr/local/lib/python3.8/site-packages/yadage/handlers/expression_handlers.py", line 53, in select_reference
    raise RuntimeError(
RuntimeError: no matches found in reference selection. selection events | result <TypedLeafs: {'fitdata': '/var/reana/users/00000000-0000-0000-0000-000000000000/workflows/e5f0bcb0-ea37-4238-8dd3-8400117d5fb1/code/fitdata.C', 'gendata': '/var/reana/users/00000000-0000-0000-0000-000000000000/workflows/e5f0bcb0-ea37-4238-8dd3-8400117d5fb1/code/gendata.C', 'nevents': 20000}>
2021-12-01 11:13:35,473 | adage | MainThread | INFO | unsubmittable: 0 | submitted: 0 | successful: 0 | failed: 0 | total: 1 | open rules: 2 | applied rules: 1
2021-12-01 11:13:35,473 | yadage.steering_api | MainThread | INFO | done. dumping workflow to disk.

but it is not reported as failed in the client:

$ reana-client status -w test                                                             
NAME   RUN_NUMBER   CREATED               STATUS 
test   1            2021-12-01T11:13:23   pending

This is because:

$ kubectl logs deployment/reana-workflow-controller job-status-consumer | tail -2 
2021-12-01 11:13:35,498 | root | MainThread | INFO |  [x] Received workflow_uuid: e5f0bcb0-ea37-4238-8dd3-8400117d5fb1 status: RunStatus.failed
2021-12-01 11:13:35,498 | root | MainThread | ERROR | Cannot transition workflow e5f0bcb0-ea37-4238-8dd3-8400117d5fb1 from status RunStatus.pending to RunStatus.failed.

Expected behaviour

The users should see this workflow as "failed".

Notes

It would be good for the workflow to report that it is running as soon as possible. IOW, the above workflow should not be in "pending" state when it fails, but should be already in "running" state. In this case we can keep the status trtansition rules unchanged, covering cases like "pending -> failed" being invalid, whilst "pending -> running -> failed" being valid.

tiborsimko commented 2 years ago

Two things while working on this issue:

VMois commented 2 years ago

CWL and Snakemake engines report running status pretty much early, before validating or big operations, so the issue should not affect them. Serial and Yadage are affected in cases when workflow parameters are not correct (maybe, other scenarios too).

VMois commented 2 years ago

Another approach to deal with this issue across all workflow engines is instead of dealing with workflow engines case by case we can publish running status in reana-commons/workflow_engine.py (run_workflow_engine_run_command function). I believe reana-commons is used in all engines. We already have similar logic for failed workflows in reana-commons/workflow_engine.py.

Possible consequences:

Possible consequences are not that bad. WDYT? Dealing with running status case by case (a) or in reana-commons (b)

cc @mvidalgarcia @audrium

VMois commented 2 years ago

As decided, we will modify case-by-case because it will be easier for now instead of releasing a new reana-commons version.