radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
Other
4 stars 3 forks source link

entire simulation stalled because of one crashed replica #75

Closed haoyuanchen closed 8 years ago

haoyuanchen commented 8 years ago

My TUU simulation with 128 replicas stuck on the exchange step in 8th cycle. I checked and found that one replica crashed. I remember that we used to have some mechanism to just ignore one replica after certain amount of time and move on to the next cycle, such as putting a zero for that replica in the exchange matrix. So what's the mechanism now for repex to treat crashed replica?

antonst commented 8 years ago

I am assuming you are using devel branch? Was it MD or exchange step that failed? For which dimension this was observed?

haoyuanchen commented 8 years ago

Yes, it's devel branch. One replica failed in MD step, which caused the exchange step waiting for it all the time. This was for U dimension.

antonst commented 8 years ago

I know that is the problem. Instead of print there is a sys.exit() call with a message. Please check latest devel branch.

haoyuanchen commented 8 years ago

I tried to run the new code but this time no replica crashed, so I can't tell whether it works or not. Will let you know if I found out. Thanks!

haoyuanchen commented 8 years ago

In a larger scale simulation with newest code (devel branch, from the virtual machine), after 7 cycles, the simulation stalled and the log file says the CU for the exchange step failed. However, I looked at it and it successfully finished, the pairs_for_exchange file is already generated and is not empty.

haoyuanchen commented 8 years ago

In another simulation in which I intentionally made some replicas easy to crash, more than a half of replicas did crash in the first MD step. The first exchange step did finish and produced the pair_for_exchange file (empty, no one exchanged) but the entire simulation still stalled and did not proceed after that. Maybe because too many replica crashed?

antonst commented 8 years ago

In a larger scale simulation with newest code (devel branch, from the virtual machine), after 7 cycles, the simulation stalled and the log file says the CU for the exchange step failed. However, I looked at it and it successfully finished, the pairs_for_exchange file is already generated and is not empty.

Is this for TUU as well? Please try to run with feature/perfopt_gen branch. Do you happen to have terminal output available for this run?

antonst commented 8 years ago

In another simulation in which I intentionally made some replicas easy to crash, more than a half of replicas did crash in the first MD step. The first exchange step did finish and produced the pair_for_exchange file (empty, no one exchanged) but the entire simulation still stalled and did not proceed after that. Maybe because too many replica crashed?

Even if all replicas crashed, simulation should continue. Please try to run with feature/perfopt_gen branch.

antonst commented 8 years ago

closed due to lack of response