atomistics.workflows.langevin shows 100% coverage on coveralls, but the tests are fundamentally flawed.
Background
When we test MD output, it is fundamentally stochastic so there's a possibility that chance may cause tests to fail. This happened in #145, where tests unrelated to the change failed and the PR was merged with failed unit tests (👎).
There we discussed some process like raising an issue for known bugs, some intent ("Still it makes sense to test the rough order of magnitude of the energy to prevent unit conversion issues and so on." - @jan-janssen, and I agree), and a patch was proposed in #146. To my chagrin, the patch just bumps two magic numbers to make the failure less likely.
Looking at the test for LangevinWorkflow (which weirdly is TestLangevin.test_langevin in the test_hessian_lammps.py file, so we should probably rename the file), I do not believe this actually tests whether or not the Langevin workflow is working as intended.
Principle
A functioning Langevin thermostat should produce not just the expected mean temperature, but also canonically expected variance in the temperature (something simpler velocity rescaling thermostats fail at).
Ideally, we would test these properties, and like particle physics we could control for the stochastic nature by ensuring that expectations are within some multiple of the expected standard deviations -- tests might still fail occassionally, but we can directly control how many sigmas out our run has to get for that to happen.
For practical purposes we might not want to run a long enough MD in the tests to get both of those statistics, although it might be plausible by simply having a very strong thermostat.
At a minimum, we can test the qualitative nature of the thermostat by checking that quantities move in the correct relative direction when increasing thermostat strength.
Practice
The relevant portion of the tests reads like this:
This is actually only testing the get_initial_velocities function, and indeed the Langevin tests can be better decomposed into unit tests here
There is no need for a range, this is the zeroth step and the result should be float-close to double the target temperature (due to the overheating scalar, which is indeed good to use here due to the ideal starting positions and energy equipartition)
There should be no magic number here. This is N kB (2*T) +- some variance, and that should be made clear by defining variables
Testing against just the final value here makes it very hard to have a reasonable formation of the variance we should expect, as the thermostat only makes promises about expectation values
We started with double temperature and are only running ten steps -- this is basically all just Langevin drag; it is at least more meaningful to test self.assertTrue(eng_kin_lst[-1] < eng_kin_lst[0])
TODO
How much does it actually cost to run equilibration and sampling with an insanely strong thermostat?
If this is not too expensive, we should check that the thermostat returns the correct: \<T>, <T^2>, and =, all after equilibration
If this is too expensive and we can really only afford to spot check a final value, the test should be run twice with a stronger and weaker thermostat -- both weak compared to the 10 steps we run for so the initial portion is dominated by drag and energy dissipation -- then we should test
atomistics.workflows.langevin
shows 100% coverage on coveralls, but the tests are fundamentally flawed.Background
When we test MD output, it is fundamentally stochastic so there's a possibility that chance may cause tests to fail. This happened in #145, where tests unrelated to the change failed and the PR was merged with failed unit tests (👎).
There we discussed some process like raising an issue for known bugs, some intent ("Still it makes sense to test the rough order of magnitude of the energy to prevent unit conversion issues and so on." - @jan-janssen, and I agree), and a patch was proposed in #146. To my chagrin, the patch just bumps two magic numbers to make the failure less likely.
Looking at the test for
LangevinWorkflow
(which weirdly isTestLangevin.test_langevin
in thetest_hessian_lammps.py
file, so we should probably rename the file), I do not believe this actually tests whether or not the Langevin workflow is working as intended.Principle
A functioning Langevin thermostat should produce not just the expected mean temperature, but also canonically expected variance in the temperature (something simpler velocity rescaling thermostats fail at).
Ideally, we would test these properties, and like particle physics we could control for the stochastic nature by ensuring that expectations are within some multiple of the expected standard deviations -- tests might still fail occassionally, but we can directly control how many sigmas out our run has to get for that to happen.
For practical purposes we might not want to run a long enough MD in the tests to get both of those statistics, although it might be plausible by simply having a very strong thermostat.
At a minimum, we can test the qualitative nature of the thermostat by checking that quantities move in the correct relative direction when increasing thermostat strength.
Practice
The relevant portion of the tests reads like this:
self.assertTrue(eng_pot_lst[-1] < 0.001)
For the Hessian this is betterself.AlmostEquals(0,...)
self.assertTrue(eng_pot_lst[-1] > 0.0)
, this is fine but with a strong thermostat we can rely on equipartition of energy to get a real target for thisself.assertTrue(eng_kin_lst[0] < 32); self.assertTrue(eng_kin_lst[0] > 20)
get_initial_velocities
function, and indeed the Langevin tests can be better decomposed into unit tests hereself.assertTrue(eng_kin_lst[-1] < 32); self.assertTrue(eng_kin_lst[-1] > 20)
self.assertTrue(eng_kin_lst[-1] < eng_kin_lst[0])
TODO
How much does it actually cost to run equilibration and sampling with an insanely strong thermostat?
If this is not too expensive, we should check that the thermostat returns the correct: \<T>, <T^2>, and=, all after equilibration
If this is too expensive and we can really only afford to spot check a final value, the test should be run twice with a stronger and weaker thermostat -- both weak compared to the 10 steps we run for so the initial portion is dominated by drag and energy dissipation -- then we should test
self.assertTrue(eng_kin_lst_weak[-1] < eng_kin_lst_weak[0])
self.assertTrue(eng_kin_lst_strong[-1] < eng_kin_lst_weak[-1])
EDIT: apparently you need escape characters to see \<T>