siesta-project / aiida_siesta_plugin

Source code for the AiiDA-Siesta package (plugin and workflows). See wiki
Other
6 stars 11 forks source link

Taxonomy of failure modes #24

Open albgar opened 5 years ago

albgar commented 5 years ago

It would be good to have a list of possible "failure modes" in a Siesta calculation, classified according to their severity and potential for recovery. This information can then be encoded in the plugin/parser and used by the workflows.

We have already identified a broad-brush classification scheme: those errors which occur before a CML file is produced should result in an "Excepted" state. Others might be given a "Failed" state with an appropriate exit code. (Note: we should catch the former before attempting to parse the CML file, and provide a proper error message.)

vdikan commented 5 years ago

I think it's already partly there: if no CML is produced, the calculation will have Excepted status with verdi process logshow showing contents of Siesta's Message file. What could be improved:

vdikan commented 5 years ago

As a bookmark, my previous email where I could sketch the taxonomy [edited]:

Remember that WorkChains now are also processes able to finish with exit code. Given that:

  • We might want to mark as Excepted the Calculations that failed badly, with no reasonable information produced [I mean the CML file first of all].
  • ~We might want to mark as Excepted the WorkChains that were not scripted properly with python~ ~(analog of badly assembled Siesta executable).~ Also, if a WorkChain contains an Excepted calculation, it should also become Excepted at once. ~Perhaps it works already this way, but I'm not sure.~ On the contrary, I'm sure that sometimes it doesn't: e.g. now, the bug in checkpoints leads to shouting serialization exceptions in the shell, but is non-lethal to the WorkChain. I think it's because aiida's job submissions involve concurrency mechanisms that are harder to control than serial code. [Important: Does a drop of poison infect the tun of wine? Depends on a size. Small "atomic" workflows that fell out should not corrupt a huge many-days-to-compute chain. At the same time they themselves should be designated as Excepted. In other words, there needs to be a mechanism to Except a workchain from within, that we may selectively use.]
  • We might want to mark as Finished with non-zero the Calculations that failed controllably an can be partly parsed/restarted, relying on the information produced. Actually, we do it now.
  • We might want to mark as Finished with non-zero the WorkChains that, during execution, contain the Finished_with_non_zero Calculations that they cannot handle further. I also show how. [In the pre-workshop example of Siesta-restart workchain]
bosonie commented 5 years ago

I'd say we all agree on marking "Excepted" the calculations where not even the CML file is created. I think that if a calculation is excepted inside a workchain, the workchain should be excepted as well. It should be good practice of any user to write workchains with checkpoints from where one can restart. An excepted workchain should mark a situation when human intervention is needed before restart.

The exit code different from zero I would implement (coming in my mind now) are: not converged scf not converged geometry problem in the basis set specifications (too small split norm for example) parsing fail of info in .xml (maybe two different from parameters and for forces/stress) parsing fail of .bands (in case bandkpoints is set)

Anytime I face a new problem I'll post it here.

bosonie commented 4 years ago

Few more situations I encountered are the following: 1) The Siesta calculation crashes leaving the .xml file incomplete (doesn't end with "cml"). In this case "minidom" from "xml.dom" raises an error. To avoid crashing of the siesta parser, we now use a OutputParsingError. However something more clear could be implemented. 2) Cases when the files to retrieve are all produced, but the Siesta calculation returns an error and the MESSAGES file reports "FATAL:". This happens for instance when there are problems with basis or pseudos. At the moment, in this case, the parser doesn't rise any error. In fact it gathers from the .xml file the few information about the siesta version and then it exits with code 0. We don't have, so far, any minimum requirement of the info to be retrieved. The info "FATAL:" of MESSAGES is parsed, but no action is implemented for that, therefore the calculation exits with code 0. This needs to be changed in my opinion.