kassonlab / run_brer

Python package for running bias-resampling ensemble refinement (BRER) simulations
GNU Lesser General Public License v2.1
0 stars 3 forks source link

HPC environment optimizations can interact badly with checkpoint verification. #26

Open eirrgang opened 3 years ago

eirrgang commented 3 years ago

We have encountered some problems with the file existence checks (e.g. https://github.com/kassonlab/run_brer/blob/532e5edf177734733a11c511e75cde922fccaf5a/run_brer/run_config.py#L231) in which newly created files are not detected, preventing correct advancement of BRER simulation phase.

On Frontera, the issue was traced all the way down to failures in the OS-level filesystem stat system call. The problem disappeared when we deactivated the Python module caching tool recommended by the HPC admins.

Though the workaround was successful, we don't have a good way to detect the underlying problem to warn users or to make our internal checking more robust.

We need some combination of documentation and Python logic to handle this situation better.

eirrgang commented 3 years ago

I have contacted Albert to try to get a clearer picture of what was going on and how big an issue it is.

As I suspected, the caching module code intercepts/overrides several Posix system calls. The code (see https://github.com/TACC/ooops for a similar example) probably operates at the granularity of the filesystem mount (not path) so it might be sufficient to make sure that we do intra-job I/O exclusively on local filesystems, but this is probably not practical and still leaves the question of how to both cache Python modules and collaborate on shared files.

There is probably not a practical way to check for such overridden system calls from the Python level. At least in the case of the TACC lmod environment module, we could check whether LD_PRELOAD is set. This would be enough to issue a warning, but we would have to actually check the shared object that it is set to if we wanted to know for sure that system calls were being tampered with. It is conceivable that we could do this with the Python ctypes module.

We might propose to the caching module authors that the machinery could be reimplemented as a Python module, which could be imported early to replace the Python import machinery alone and/or provide alternative os module functionality for un-modified filesystem calls.

The only clear course of action I can firmly recommend right now is to experiment a bit with the python_cacher environment module to make sure we understand and document its effects on the code in run_brer/run_config.py