reanahub / reana-demo-helloworld

REANA example - "hello world"
MIT License
3 stars 35 forks source link

yadage: Slurm compute backend example #48

Closed tiborsimko closed 3 years ago

tiborsimko commented 3 years ago

FYI, the example does not work:

$ reana-client logs -w hello-yad-hpc
...

2021-02-18 14:21:17,881 | reana-workflow-engine-yadage | MainThread | INFO | Finalizing the progress tracking for: <yadage.wflow.YadageWorkflow object at 0x7f88e5bce130>
2021-02-18 14:21:17,886 | yadage.steering_api | MainThread | INFO | done. dumping workflow to disk.
2021-02-18 14:21:17,889 | reana-workflow-engine-yadage | MainThread | ERROR | Workflow failed: workflow finished but failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/reana_workflow_engine_yadage/cli.py", line 156, in run_yadage_workflow
    ys.adage_argument(
  File "/usr/local/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.8/site-packages/yadage/steering_api.py", line 110, in steering_ctx
    execute_steering(
  File "/usr/local/lib/python3.8/site-packages/yadage/steering_api.py", line 60, in execute_steering
    ys.run_adage(backend)
  File "/usr/local/lib/python3.8/site-packages/yadage/steering_object.py", line 100, in run_adage
    adage.rundag(controller=self.controller, **self.adage_kwargs)
  File "/usr/local/lib/python3.8/site-packages/adage/__init__.py", line 137, in rundag
    run_polling_workflow(controller, coroutine, update_interval, trackerlist, maxsteps)
  File "/usr/local/lib/python3.8/site-packages/adage/__init__.py", line 51, in run_polling_workflow
    for stepnum, controller in enumerate(coroutine):
  File "/usr/local/lib/python3.8/site-packages/adage/pollingexec.py", line 89, in adage_coroutine
    raise RuntimeError('workflow finished but failed')
RuntimeError: workflow finished but failed
2021-02-18 14:21:17,890 | root | MainThread | ERROR | Error while publishing channel disconnected

....

==> Job logs
==> Step: helloworld
...
==> Status: failed
==> Logs:
Auks API request failed : krb5 cred : unable to read credential cache
INFO:    Converting OCI blobs to SIF format
srun: error: hpc009: task 0: Exited with exit code 255
srun: Terminating job step 951650.0
FATAL:   Unable to handle docker://python:2.7-slim uri: while building SIF from layers: unable to create new build: while searching for mksquashfs: exec: "mksquashfs": executable file not found in $PATH

This is similar to (but different from) the RooFit example troubles, see https://github.com/reanahub/reana-demo-root6-roofit/pull/44, indicating r-w-e-yadage issues with Slurm integration.

tiborsimko commented 3 years ago

Let's check our singularity wrapper in reana-job-controller component. The current version on the gate is 3.7.1-1.el7.