libAtoms / workflow

python workflow toolkit
GNU General Public License v2.0
24 stars 17 forks source link

unexpectedly changed rng state confuses detection of identical jobs #284

Closed bernstei closed 5 months ago

bernstei commented 6 months ago

Problems (probably like #283) can be caused by writing workflow scripts that do not ensure that the random number generator is in the same state each time. In particular, ops like minim and MD can use random pressures, and therefore depend on a random seed.

If an op is entirely done, its output file is proof that it's done, and then when the script is run again its output is just read back. If, however, the script is interrupted while waiting for a queued job to finish, on the rerun the rng state might be different (either because the user didn't set a deterministic seed, or because an earlier minim/MD actually ran the first time, hence changing the rng state, but didn't run the second time, hence not changing the rng state). If this happens, the identical-job detector will label those jobs as different (because of the seed being different), and not attempt to get results from the existing job stage dir.

Obviously a script that doesn't set a deterministic seed is simply not deterministic, and maybe it's not reasonable to expect that rerunning it will successfully cache previous runs' results. It might be nice to make this harder to do accidentally, e.g. by requiring the user to pass a numpy Generator object to ops that use random numbers. Maybe those should work OK without a random seed as long as they don't actually need to generate random pressures?

The second failure mode (a completed operation being skipped in a rerun not advancing the rng state in the same way that it did the first run, when it actually did some work) is harder to deal with. I've implemented rng state caching functions, but I'm not sure how smoothly those will work in real world situations. Maybe this needs to be part of a sort of utils package, which helps write more reliable workflows?

bernstei commented 6 months ago

I'm working on a PR to make this easier to do properly.