materialsproject / fireworks

The Fireworks Workflow Management Repo.
https://materialsproject.github.io/fireworks
Other
361 stars 185 forks source link

fixed Workflow.state #305

Closed richardjgowers closed 6 years ago

richardjgowers commented 6 years ago

Workflow.state isn't entirely reliable, for example in the Workflow below, this is reported as FIZZLED (in this case because a leaf node is WAITING).

screen shot 2018-08-07 at 11 52 43 am

From what I can tell, the only way to make the FIZZLED workflow state work is to check the spec of children (essentially FIZZLED == COMPLETED if all children are cool with it), but this will be slower because of the lazy attribute lookups.

I think this fixes it, unless I've missed some other corner cases

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.03%) to 60.491% when pulling f93f47d1156283cf6c87186c91808f8297c65e80 on richardjgowers:fizzled_wf_fix into 619ce1e1cc1f1377ce08ba3d68802269aff99448 on materialsproject:master.

computron commented 6 years ago

Hi Richard,

I am not sure I understand the problem; I haven't reviewed the code yet, but from the illustration I would say the workflow should indeed be "FIZZLED". See the documentation here: https://materialsproject.github.io/fireworks/reference.html#interpretation-of-state-of-fws-and-wfs

Basically, if any job in the workflow is FIZZLED, the workflow itself is considered FIZZLED - in your diagram, the workflow has no hope of completing (unless you rerun the job with red background).

I had trouble understanding the following text:

From what I can tell, the only way to make the FIZZLED workflow state work is to check the spec of children (essentially FIZZLED == COMPLETED if all children are cool with it), but this will be slower because of the lazy attribute lookups.

If you can give a little more context it would probably help me understand.

richardjgowers commented 6 years ago

@computron So to work around walltime limits on clusters, I fizzle lost fireworks, then the Firework after them (which allows fizzled parents) checks their exit status, and issues a detour with the remaining simulation time to run. Eg I want to run 100 iterations of some simulation, it gets killed after 75 by the cluster, so the next firework creates a new Firework with the last 25 iterations.

In this case the Workflow is perfectly fine with some FIZZLED FWs as long as it recovered afterwards. So the way to interpret this is to allow FIZZLED if all children allow fizzled parents.

computron commented 6 years ago

Thanks! This explains it much better. Actually I wasn't familiar with the original logic of that piece of code, this looks much improved.