jdrusso / msm_we

History-augmented Markov analysis of weighted ensemble trajectories.
https://msm-we.readthedocs.io
MIT License
7 stars 7 forks source link

Handling failed model building #22

Open jdrusso opened 2 years ago

jdrusso commented 2 years ago

The haMSM building + optimization plugins can be fragile to failures during model-building or optimization.

When attempting to restart after a failure, issues arise such as

Need to see if there's a good way to restore WESTPA state when launching directly into plugin execution

jdrusso commented 2 years ago

Example:

This was a run that I initialized and ran to model building + optimization, where it crashed. I started it again with w_run, where it then completed the model building successfully, performed the optimization, and then failed with

exception caught; shutting down
-- ERROR    [w_run] -- error message: 'WEDriver' object has no attribute '_parent_map'
-- ERROR    [w_run] -- Traceback (most recent call last):
  File "/home/jd/westpa/src/westpa/cli/core/w_run.py", line 65, in run_simulation
    sim_manager.finalize_run()
  File "/home/jd/westpa/src/westpa/core/sim_manager.py", line 815, in finalize_run
    self.invoke_callbacks(self.finalize_run)
  File "/home/jd/westpa/src/westpa/core/sim_manager.py", line 140, in invoke_callbacks
    fn(*args, **kwargs)
  File "/home/jd/msm_we/msm_we/westpa_plugins/optimization_driver.py", line 143, in do_optimization
    self.update_westpa_pcoord(new_pcoord_map)
  File "/home/jd/msm_we/msm_we/westpa_plugins/optimization_driver.py", line 380, in update_westpa_pcoord
    parent_state_index = get_segment_parent_index(segment)
  File "/home/jd/research/SynD/synd/westpa/propagator.py", line 50, in get_segment_parent_index
    parent_map = sim_manager.we_driver._parent_map
AttributeError: 'WEDriver' object has no attribute '_parent_map'

It seems like when I launched it again, the _parent_map was not populated, so it fails

jdrusso commented 2 years ago

Part of this is because of the way I specify restarting iterations.

the max_iterations WESTPA gets is "number of iterations between optimizations", then the optimization plugin extends that.

But that means when I re-initialize the system, it sees the run as completed.

I can run the plugin as post-iteration, when n_iter % restart_interval == 0 instead.