reanahub / reana-workflow-engine-snakemake

REANA Workflow Engine Snakemake
MIT License
0 stars 22 forks source link

executor, add job status polling #34

Closed VMois closed 2 years ago

VMois commented 2 years ago

closes #33

Initial implementation of this PR relied on both file checking and polling in two separate threads. I don't think it was a good idea, so the final version completely relies on the polling job controller.

How to test:

  1. Check out this branch and ensure that snakemake-engine Docker image is updated.

  2. There are 3 main scenarios:

You can easily test the first two scenarios. The third one requires more setup, check the review notes below for more details. The most important thing is to check that no bugs were introduced.

  1. Check if the progress state is displayed correctly in reana-client list --include-progress

  2. Check if engine and jobs logs are displayed

Review notes:

codecov-commenter commented 2 years ago

Codecov Report

Merging #34 (ae76a58) into master (7269f5a) will decrease coverage by 0.66%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #34      +/-   ##
=========================================
- Coverage    4.92%   4.26%   -0.67%     
=========================================
  Files           6       6              
  Lines         142     164      +22     
=========================================
  Hits            7       7              
- Misses        135     157      +22     
Impacted Files Coverage Δ
reana_workflow_engine_snakemake/executor.py 0.00% <0.00%> (ø)
VMois commented 2 years ago

The logs from failed jobs are still gone, I assume is part of this issue.

Not sure about that, might be more related to this issue. Could you, please, share what did you do exactly to get this error?

mvidalgarcia commented 2 years ago

The logs from failed jobs are still gone, I assume is part of this issue.

Not sure about that, might be more related to this issue. Could you, please, share what did you do exactly to get this error?

Yeah, might be. This problem wasn't there before because we performed many tests involving wrong memory values and it was reported back to the user. Here is the validation logic. Could it be related to the new changes?

In addition, checking the workflow-engine logs, I see that when the job is reported as failed, we don't capture its ID properly.

$ kubectl logs reana-run-batch-foo workflow-engine
...
2022-01-10 11:17:44,222 | reana-workflow-engine-snakemake | MainThread | INFO | Job 'helloworld' received, command: python code/helloworld.py --inputfile data/names.txt --outputfile results/greetings.txt --sleeptime 0
2022-01-10 11:17:44,222 | reana-workflow-engine-snakemake | MainThread | INFO | Environment: python:2.7-slim
2022-01-10 11:17:44,504 | reana-workflow-engine-snakemake | MainThread | ERROR | Error submitting job helloworld: Job submission error: The "kubernetes_memory_limit" provided 128MB has wrong format.
2022-01-10 11:17:44,507 | reana-workflow-engine-snakemake | MainThread | INFO | failed job: None
...
VMois commented 2 years ago

This problem wasn't there before because we performed many tests involving wrong memory values and it was reported back to the user.

It is reported in Yadage and Serial but not in Snakemake (and, I think, also CWL). You can test a wrong value with Snakemake master branch. There will be no logs with reana-client logs.

In addition, checking the workflow-engine logs, I see that when the job is reported as failed, we don't capture its ID properly.

Same, it happens with the master branch too. This is due to a reason that submission fails so the job doesn't have an ID.