lsmo-epfl / aiida-lsmo

AiiDA workflows for the LSMO laboratory at EPFL
Other
9 stars 13 forks source link

flaky tests #102

Closed ltalirz closed 1 year ago

ltalirz commented 1 year ago

Looking at the recent CI builds, it appears that at least one of the tests can fail at random (out of the last 4 builds, run_binding_energy_co2_mof74 failed twice and succeeded twice without any changes that should be relevant to it).

A quick look at the test does not tell me what exactly the problem is; this would require local debugging

Error message on CI:

_________________________ run_binding_energy_co2_mof74 _________________________

cp2k_code = <Code: Remote code 'mock-cp2k-7.1-184360d5-02ea-4707-8a59-a14e70d70bee' on localhost-test, pk: [32](https://github.com/lsmo-epfl/aiida-lsmo/actions/runs/3454892024/jobs/5766526762#step:6:33), uuid: 7ad89f67-ed84-4d04-9adb-d76d17fb4a62>
zn_mof74 = <StructureData: uuid: 80e4c61b-2627-4f1b-8b18-b31ff5c43e84 (pk: 39)>
co2_in_mof74 = <StructureData: uuid: 912e7d2c-e308-4b81-b677-0a9fdc60f07f (pk: 40)>

    def run_binding_energy_co2_mof74(cp2k_code, zn_mof74, co2_in_mof74):  # pylint: disable=redefined-outer-name
        """Compute binding energy of CO2 in MOF 74"""

        print('Testing CP2K BindingEnergy work chain for CO2 in Zn-MOF-74 ...')

        # Construct process builder
        builder = BindingEnergyWorkChain.get_builder()
        builder.structure = zn_mof74
        builder.molecule = co2_in_mof74
        builder.protocol_tag = Str('test')
        builder.cp2k_base.cp2k.parameters = Dict(dict={ # Lowering CP2K default setting for a faster test calculation
            'FORCE_EVAL': {
                'DFT': {
                    'SCF': {
                        'EPS_SCF': 1.0E-4,
                        'OUTER_SCF': {
                            'EPS_SCF': 1.0E-4,
                        },
                    },
                },
            },
            'MOTION': {
                'GEO_OPT': {
                    'MAX_ITER': 5
                }
            },
        })
        builder.cp2k_base.cp2k.code = cp2k_code
        builder.cp2k_base.cp2k.metadata.options.resources = {
            'num_machines': 1,
            'num_mpiprocs_per_machine': 1,  # increase this to 4 in order to speed up the calculation
        }
        builder.cp2k_base.cp2k.metadata.options.withmpi = False  # comment this for parallel cp2k executable
        builder.cp2k_base.cp2k.metadata.options.max_wallclock_seconds = 1 * 60 * 60

        # The following is not needed, if the files are available in the data directory of your CP2K executable
        cp2k_dir = DATA_DIR / 'cp2k'
        builder.cp2k_base.cp2k.file = {
            'basis': SinglefileData(file=str(cp2k_dir / 'BASIS_MOLOPT')),
            'pseudo': SinglefileData(file=str(cp2k_dir / 'GTH_POTENTIALS')),
            'dftd3': SinglefileData(file=str(cp2k_dir / 'dftd3.dat')),
        }

        results, node = engine.run_get_node(builder)

>       assert node.is_finished_ok, results
E       AssertionError: {}
E       assert False
E        +  where False = <WorkChainNode: uuid: 314914fd-c5f2-4b72-b09a-b7e0c707d[42](https://github.com/lsmo-epfl/aiida-lsmo/actions/runs/3454892024/jobs/5766526762#step:6:43)c (pk: [44](https://github.com/lsmo-epfl/aiida-lsmo/actions/runs/3454892024/jobs/5766526762#step:6:45)) (aiida.workflows:lsmo.cp2k_binding_energy)>.is_finished_ok

examples/test_Cp2kBindingEnergy_CO2_MOF[74](https://github.com/lsmo-epfl/aiida-lsmo/actions/runs/3454892024/jobs/5766526762#step:6:75).py:[82](https://github.com/lsmo-epfl/aiida-lsmo/actions/runs/3454892024/jobs/5766526762#step:6:83): AssertionError
mpougin commented 1 year ago

I assume the problem are the asserted results, the tolerances are very low especially for the reduced accuracy. Strangely it always succeeds locally for me. But I will make a new PR and test with less strict tolerances

mpougin commented 1 year ago

as spotted by @chrisjsewell: the test fails due to rounding errors in the xyz file that gets created https://github.com/aiidateam/aiida-cp2k/blob/7cf6d7645cce64af113225434a89eb681b27bc58/aiida_cp2k/calculations/__init__.py#L386

ltalirz commented 1 year ago

Oh, I see.

The issue is that the machine epsilon of 64bit floating point numbers is e=2.2e-16, i.e. printing floating point numbers to 16 digits (as this function does) enters into territory outside the precision to which the numbers are actually represented in the computer.

You basically have two choices here of how to address the problem: A) eliminate the randomness B) reduce the precision to which the atomic positions are printed

B) is the easier route - printing the numbers only to, say, 10 digits would probably solve the issue immediately.

It probably has negligible effect on calculations in practice, but it would mean that there will be a slight difference in the atomic positions between a structure read from cp2k output, and a new input structure written by AiiDA.

As for the randomness, there are in principle two possible sources:

  1. atoms.get_positions()
  2. python string interpolation

My guess would be it is the first. I'm not sure it will be straightforward to fix, as it might depend on details like whether numpy is using the MKL or another BLAS library for certain transformations of the cell, etc.

mpougin commented 1 year ago

actually the differences are really big (see attached) can this be caused just by the rounding? And if yes, reducing the precision to 10 digits would not solve it right? image

kjappelbaum commented 1 year ago

actually the differences are really big (see attached) can this be caused just by the rounding? And if yes, reducing the precision to 10 digits would not solve it right?

what are the two screenshots you are showing? How did you create them?

mpougin commented 1 year ago

actually the differences are really big (see attached) can this be caused just by the rounding? And if yes, reducing the precision to 10 digits would not solve it right?

what are the two screenshots you are showing? How did you create them?

@chrisjsewell ?

ltalirz commented 1 year ago

actually the differences are really big (see attached) can this be caused just by the rounding? And if yes, reducing the precision to 10 digits would not solve it right? image

If these two XYZ files were produced by calling get_positions on the same StructureData, then there is indeed a bug, not a rounding error.

chrisjsewell commented 1 year ago

If these two XYZ files were produced by calling get_positions on the same StructureData, then there is indeed a bug, not a rounding error.

Note there are two different things possibly at play here:

  1. the top (Zn) diff could indeed just be a rounding error difference when creating the string {x:25.16f}

  2. the bottom (COO) diff is to do with how the CO2 molecule is "attached" to the ZnMOF74 within the workchain; so thats more broadly related to the workchain and its python code e.g. the aiida_structure_merge calcfunction, version of ASE etc, i.e. potentially "compounded" rounding errors

chrisjsewell commented 1 year ago

what are the two screenshots you are showing? How did you create them?

From #103; run locally and looking at CI output

mpougin commented 1 year ago

I think we can close this now @ltalirz ?

ltalirz commented 1 year ago

Sure!

ebp-bot commented 1 year ago

The only thing I'd note, is that #108 is a very "dumb" fix, specific to the one failing test (just mapping its input hash to another one). If you add other tests, that include this "merging" of a molecule onto a MOF, that will probably fail as well. So just bear that in mind.

As noted above, the differences in the atomic position of the molecule, for CI vs local, are relatively big, so if you do have at some point, it might be good to look into why that is

chrisjsewell commented 1 year ago

oops thats me, using the wrong account lol

mpougin commented 1 year ago

great, thanks for the heads-up @chrisjsewell