Allow engines to gracefully handle evaluate_process failures

greschd commented 4 years ago

The optimization engine should be able to decide if a failed evaluate process means failure of the entire workchain, or if it can continue in some way.

This can probably be done by extending the interface of the engine's update method, but the details need fleshing out.

greschd commented 4 years ago

@zooks97 I'm considering some options as to what to pass to the engine here:

just a list of processes (by index) that failed
the exit codes of the processes
the entire failed process

If the engine is supposed to be implemented in a generic way (i.e., independent of which evaluate process is run), passing the list of failed processes should be enough. But on the other hand, it's conceivable that we might want to create an engine tailored to a specific use case.

What's your opinion here? Do you have a better idea?

zooks97 commented 4 years ago

I think that the engine should be able to access the failed process, mostly so that it can check the process' exit code and also any exit codes down the line from that process (e.g., for quantum espresso, if the base work chain fails, you may want to go check the exit codes of its child process(es), the calculations).

I'm not sure if this is correct, but I think passing just the indices is good enough for this purpose and allows the engine to choose whether it wants to go through and load the processes and check them out or not without us having to pass the entire failed processes around.

Passing the index also makes it much easier for the engine to ignore the appropriate indices in its inputs and outputs when checking if optimization is complete or when deciding what to do next.

zooks97 commented 3 years ago

This issue has come up again recently in discussions about the aiida-sssp-workflows package as it is moving towards production.

I still think that passing failed processes by index is the most lightweight but still flexible solution. However, I'm wondering what to pass as an output in the case of failure.

@greschd do you have any new thoughts on how you'd like to see this implemented?

greschd commented 3 years ago

In the aiida-sssp-workflows case, do you need access to the specific details of the failed process, or just know the fact that it failed?

In general, I agree that passing the index is the simplest solution. One thing to consider is that the "index" as used by aiida-optimize is not the same as the process PK; so maybe we would need to pass both.

One worry with putting too much logic into the engine is that we would (in general) like the engines to be usable independently of the details of the "evaluation" process. I think for a certain class of failures (where they still produce the expected output in some form), the "correct" way of handling it would be putting that logic into the evaluation process itself.

By the way, sort of off-topic: I've added a new "concatenate" wrapper workchain that can run an arbitrary number of connected processes (extending the concept of the CreateEvaluateWorkChain). That may help in adding a lightweight "error handler" at the end.

zooks97 commented 3 years ago

In the case of aiida-sssp-workflows, I think we only really need to know the fact that the process failed.

Although the aiida-optimize index is not the same as the process PK, the engine has access to enough information that it could look up the process PK/UUID/node and get it on its own, right?

I agree that the evaluation process should be in charge of communicating when its failure is critical or not. Maybe we can supply a wrapper workchain where you can define the appropriate logic if it is not already baked into the underlying evaluation process.

greschd commented 3 years ago

Although the aiida-optimize index is not the same as the process PK, the engine has access to enough information that it could look up the process PK/UUID/node and get it on its own, right?

I think it's possible in principle, but would be a bit of a hack. In general, the workchain takes care of the mapping between AiiDA PK and "our" indices.

I agree that the evaluation process should be in charge of communicating when its failure is critical or not. Maybe we can supply a wrapper workchain where you can define the appropriate logic

Yeah, that makes sense.

In the case of aiida-sssp-workflows, I think we only really need to know the fact that the process failed.

Ok, I think it makes sense to cover this case first. Here's a very rough sketch of things to do:

add a failed flag to the Result class
change (and maybe rename) ResultMapping.add_outputs and OptimizationEngineImpl.update to carry that through.
for the _update method maybe it would be good to add an alternative _update_with_errors, such that engines that can't deal with errors don't need to add special code for that
add a method for engines to signal their failure up to the OptimizationWorkChain
back in OptimizationEngineImpl.update, implement the switch between engines that have _update_with_errors and those with _update. For those with a plain _update, signal failure if there is a failed process.

This is just a first idea of how it could be done, so any suggestions for improving upon that are very much welcome.

greschd / aiida-optimize

Allow engines to gracefully handle evaluate_process failures #22