gwforg / gwf

A flexible, pragmatic workflow tool.
https://gwf.app/
GNU General Public License v3.0
31 stars 12 forks source link

Target that should not run yet fails #405

Closed LudvigOlsen closed 1 year ago

LudvigOlsen commented 1 year ago

I have a workflow with 2 targets (say A and B), where the input to B is created by A. I submit both jobs at the same time and shortly after, A is running and B is failed.

Here, I first print the path to the input file and whether it exists, then the status (gwfss is just an alias for summary). A is submitted and B failed. In this case, I actually only asked to submit B but it also submitted A, so it knows B depends on A.

Skærmbillede 2023-05-17 kl  15 13 05

Now, usually I suspect this type of error to be my own, but in this case, the workflow is so simple, that it seems to be a bug. I have had some instances previously, where I suspected this to be a bug but where the workflow was way too complex to be certain.

Here is the code for submitting B. Note that A is supposed to make sample_dir / "dataset" / "feature_dataset.npy" so it doesn't currently exist. to_strings is just a list comprehension converting Paths to strings.

billede

The job fails when it cannot find the feature_dataset.npy.

Skærmbillede 2023-05-17 kl  15 06 58

When looking in gwf info for B, it correctly has the ...feature_dataset.npy path in inputs, and so it shouldn't run as that file does not exist.

Let me know, if you need other information.

dansondergaard commented 1 year ago

What does jobinfo say about target A? Could it be that A didn't produce that output files it should, but still completed with a zero return code? Then B would start running, but fail since an input file is missing.

Does it still say "submitted" after some time? Slurm is a bit unreliable when it comes to fetching the status, so it can take 10-30 seconds before it reports the correct status for a job/target.

If you can produce a minimal example the reproduces the error, that would be great!

LudvigOlsen commented 1 year ago

It starts running after a short while. A is quite a long job (currently fails after 20+ minutes due to some bugs I'm working through).

Jobinfo for A:

Skærmbillede 2023-05-17 kl  20 00 22

Start times are identical in jobinfo: B: 2023-05-17T13:58:42 A: 2023-05-17T13:58:42

End time for B is 2023-05-17T13:58:52, A is still running.

Will see if I can make a reproducible example in the coming days :-) (It's 8PM here in Singapore)

dansondergaard commented 1 year ago

Can you provide the complete output of gwf info for target B? Also, can you provide the path (over e-mail) to the workflow file? Thanks :-)

LudvigOlsen commented 1 year ago

I've sent it all in a mail :-)