TUU stuck in first cycle

haoyuanchen commented 8 years ago

In a TUU production run, after all replicas finished their first MD cycle, the simulation just stopped and doing nothing there. No errors were reported in the log file. Also, it seems that a CU for exchange calculation were not created.

Input:

{ "remd.input": { "re_pattern": "S", "number_of_cycles": "39", "input_folder": "tuu_remd_inputs", "input_file_basename": "ace_ala_nme_remd", "amber_input": "ace_ala_nme.mdin", "us_template": "ace_ala_nme_double.RST", "amber_parameters": "ace_ala_nme.parm7", "amber_coordinates_folder": "ace_ala_nme_coors_8x8", "same_coordinates": "False", "group_exec": "False", "init_temp": "300.0", "replica_mpi": "False", "replica_cores": "1", "steps_per_cycle": "10000", "download_mdinfo": "False", "download_mdout" : "False" }, "dim.input": { "d1": { "type" : "umbrella", "number_of_replicas": "8", "min_us_param": "45.0", "max_us_param": "360.0" }, "d2": { "type" : "temperature", "number_of_replicas": "11", "min_temperature": "273.0", "max_temperature": "373.0" }, "d3": { "type" : "umbrella", "number_of_replicas": "8", "min_us_param": "45.0", "max_us_param": "360.0" } } }

antonst commented 8 years ago

How many cores are you requesting for pilot? Do you happen to have a terminal output somewhere?

haoyuanchen commented 8 years ago

How many cores are you requesting for pilot?

I have 704 replicas and I requested for 768 cores.

Do you happen to have a terminal output somewhere?

For the first MD cycle, all the CUs finished normally, the outputs are normal, such as

2016-04-19 07:20:00,474: radical.repex : MainProcess : Thread-3 : INFO : ComputeUnit 'unit.000132' state changed to AgentStagingOutput. 2016-04-19 07:20:03,980: radical.repex : MainProcess : Thread-3 : INFO : ComputeUnit 'unit.000132' state changed to Done.

After first MD cycle (all the CUs are in the stage "Done"), no additional outputs were generated (I didn't turn on VERBOSE=debug, but I guess if some error occurred it should output something).

antonst commented 8 years ago

I can't reproduce this problem. I have added a couple of print statements in the code and it seems to work correctly. I don't really understand what caused this problem.

antonst commented 8 years ago

I am using exactly your input file except for a number of time-steps, which is set to 100

haoyuanchen commented 8 years ago

I can't reproduce this problem. I have added a couple of print statements in the code and it seems to work correctly. I don't really understand what caused this problem.

It is strange. I'll try to re-run it too. Did you see any exchanges in that simulation, by the way?

antonst commented 8 years ago

On which machine are you running?

haoyuanchen commented 8 years ago

stampede

antonst commented 8 years ago

That is that I thought. OK, I have two possible explanations of this. First, some of the replicas actually never get to 'Done' state, that is never finish execution. Can this happen for this type of simulation? Second, ther is a problem with a wait call - it does not return despite the fact that all CUs have finished.

antonst commented 8 years ago

I also see simulation stuck now, but this does not happen during first cycle.

antonst commented 8 years ago

Speaking of exchanges, I see exchanges in temperature dimension, but not in umbrella dimension, but this might be attributed to small number of time-steps

antonst commented 8 years ago

btw I tried to run at a smaller scale and up to 256 replicas, I don't see this error happening. Will investigate more.

antonst commented 8 years ago

I have double checked: all CUs have finished their execution and generated their matrix_column file, but RP's um.wait_units() call does not return. In terminal log about 1/2 of CU's are not reported as finished by RP. Nothing fails and no errors are reported. Also according to callbacks, only about 1/2 of replicas got to AgentStagingOutput stage.

antonst commented 8 years ago

I think this error was caused by MongoDB. I have cleaned and repaired the DB now. Using your input file I was able to perform 10 cycles with 704 replicas on Stampede without any issues. Please try to re-run. I would also strongly encourage you to create your own mongo-lab account.

haoyuanchen commented 8 years ago

I think this error was caused by MongoDB. I have cleaned and repaired the DB now. Using your input file I was able to perform 10 cycles with 704 replicas on Stampede without any issues. Please try to re-run. I would also strongly encourage you to create your own mongo-lab account.

Thanks! I'll try to create my own mongo-lab account.

haoyuanchen commented 8 years ago

I was able to perform 10 cycles with 704 replicas on Stampede without any issues.

How long was the simulation? In my re-run, everything was fine until the 5th cycle, where I got the following error that looks like also related to MongoDB:

Exception in thread OutputFileTransferWorker-1: Traceback (most recent call last): File "/usr/lib64/python2.7/threading.py", line 551, in __bootstrap_inner self.run() File "/home/haoyuan/myenv1/lib/python2.7/site-packages/radical.pilot-0.40.1-py2.7.egg/radical/pilot/controller/output_file_transfer_worker.py", line 90, in run "timestamp": ts} File "build/bdist.linux-x86_64/egg/pymongo/collection.py", line 1738, in find_and_modify _kwargs) File "build/bdist.linux-x86_64/egg/pymongo/database.py", line 439, in command uuid_subtype, compile_re, _kwargs)[0] File "build/bdist.linux-x86_64/egg/pymongo/database.py", line 345, in _command msg, allowable_errors) File "build/bdist.linux-x86_64/egg/pymongo/helpers.py", line 182, in _check_command_response raise OperationFailure(msg % errmsg, code, response) OperationFailure: command SON([('findAndModify', u'rp.session.prot-55-247.rutgers.edu.haoyuan.016910.0000.cu'), ('query', {'control': 'agent', 'state': 'PendingOutputStaging', 'unitmanager': 'umgr.0000'}), ('update', {'$set': {'control': 'umgr', 'state': 'StagingOutput'}, '$push': {'statehistory': {'timestamp': 1461206133.492433, 'state': 'StagingOutput'}}})]) on namespace cdi-testing.$cmd failed: exception: executor returned DEAD while finding document to update

Exception in thread OutputFileTransferWorker-2: Traceback (most recent call last): File "/usr/lib64/python2.7/threading.py", line 551, in __bootstrap_inner self.run() File "/home/haoyuan/myenv1/lib/python2.7/site-packages/radical.pilot-0.40.1-py2.7.egg/radical/pilot/controller/output_file_transfer_worker.py", line 90, in run "timestamp": ts} File "build/bdist.linux-x86_64/egg/pymongo/collection.py", line 1738, in find_and_modify _kwargs) File "build/bdist.linux-x86_64/egg/pymongo/database.py", line 439, in command uuid_subtype, compile_re, _kwargs)[0] File "build/bdist.linux-x86_64/egg/pymongo/database.py", line 345, in _command msg, allowable_errors) File "build/bdist.linux-x86_64/egg/pymongo/helpers.py", line 182, in _check_command_response raise OperationFailure(msg % errmsg, code, response) OperationFailure: command SON([('findAndModify', u'rp.session.prot-55-247.rutgers.edu.haoyuan.016910.0000.cu'), ('query', {'control': 'agent', 'state': 'PendingOutputStaging', 'unitmanager': 'umgr.0000'}), ('update', {'$set': {'control': 'umgr', 'state': 'StagingOutput'}, '$push': {'statehistory': {'timestamp': 1461206133.458572, 'state': 'StagingOutput'}}})]) on namespace cdi-testing.$cmd failed: exception: executor returned DEAD while finding document to update

Exception in thread InputFileTransferWorker-1: Traceback (most recent call last): File "/usr/lib64/python2.7/threading.py", line 551, in __bootstrap_inner self.run() File "/home/haoyuan/myenv1/lib/python2.7/site-packages/radical.pilot-0.40.1-py2.7.egg/radical/pilot/controller/input_file_transfer_worker.py", line 92, in run "$push": {"statehistory": {"state": STAGING_INPUT, "timestamp": ts}}} File "build/bdist.linux-x86_64/egg/pymongo/collection.py", line 1738, in find_and_modify _kwargs) File "build/bdist.linux-x86_64/egg/pymongo/database.py", line 439, in command uuid_subtype, compile_re, _kwargs)[0] File "build/bdist.linux-x86_64/egg/pymongo/database.py", line 345, in _command msg, allowable_errors) File "build/bdist.linux-x86_64/egg/pymongo/helpers.py", line 182, in _check_command_response raise OperationFailure(msg % errmsg, code, response) OperationFailure: command SON([('findAndModify', u'rp.session.prot-55-247.rutgers.edu.haoyuan.016910.0000.cu'), ('query', {'state': 'PendingInputStaging', 'unitmanager': 'umgr.0000'}), ('update', {'$set': {'state': 'StagingInput'}, '$push': {'statehistory': {'timestamp': 1461206133.458965, 'state': 'StagingInput'}}})]) on namespace cdi-testing.$cmd failed: exception: executor returned DEAD while finding document to update

Exception in thread InputFileTransferWorker-2: Traceback (most recent call last): File "/usr/lib64/python2.7/threading.py", line 551, in __bootstrap_inner self.run() File "/home/haoyuan/myenv1/lib/python2.7/site-packages/radical.pilot-0.40.1-py2.7.egg/radical/pilot/controller/input_file_transfer_worker.py", line 92, in run "$push": {"statehistory": {"state": STAGING_INPUT, "timestamp": ts}}} File "build/bdist.linux-x86_64/egg/pymongo/collection.py", line 1738, in find_and_modify _kwargs) File "build/bdist.linux-x86_64/egg/pymongo/database.py", line 439, in command uuid_subtype, compile_re, _kwargs)[0] File "build/bdist.linux-x86_64/egg/pymongo/database.py", line 345, in _command msg, allowable_errors) File "build/bdist.linux-x86_64/egg/pymongo/helpers.py", line 182, in _check_command_response raise OperationFailure(msg % errmsg, code, response) OperationFailure: command SON([('findAndModify', u'rp.session.prot-55-247.rutgers.edu.haoyuan.016910.0000.cu'), ('query', {'state': 'PendingInputStaging', 'unitmanager': 'umgr.0000'}), ('update', {'$set': {'state': 'StagingInput'}, '$push': {'statehistory': {'timestamp': 1461206133.458308, 'state': 'StagingInput'}}})]) on namespace cdi-testing.$cmd failed: exception: executor returned DEAD while finding document to update

Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib64/python2.7/threading.py", line 551, in bootstrap_inner self.run() File "/home/haoyuan/myenv1/lib/python2.7/site-packages/radical.pilot-0.40.1-py2.7.egg/radical/pilot/controller/unit_manager_controller.py", line 236, in run unit_list = self._dbs.get_compute_units(unit_manager_id=self.uid) File "/home/haoyuan/myenv1/lib/python2.7/site-packages/radical.pilot-0.40.1-py2.7.egg/radical/pilot/db/database.py", line 428, in get_compute_units for obj in cursor: File "build/bdist.linux-x86_64/egg/pymongo/cursor.py", line 1076, in next if len(self.data) or self._refresh(): File "build/bdist.linux-x86_64/egg/pymongo/cursor.py", line 1037, in _refresh limit, self.id)) File "build/bdist.linux-x86_64/egg/pymongo/cursor.py", line 958, in __send_message self.compile_re) File "build/bdist.linux-x86_64/egg/pymongo/helpers.py", line 113, in _unpack_response error_object) OperationFailure: database error: collection dropped between getMore calls

antonst commented 8 years ago

From where are you submitting your run? Are you using your own db? I checked mine and it is still empty after a clean-up I made yesterday.

antonst commented 8 years ago

Also now in devel is available a simulation restart feature. If you want to start from where a previous run stopped add:

"restart": "True",
"restart_file": "simulation_objects_1_3.pkl"

to simulation input file. Here restart_file points to a latest file with simulation info which was generated during the failed run. First index is cycle and second index is dimension.

haoyuanchen commented 8 years ago

From where are you submitting your run? Are you using your own db? I checked mine and it is still empty after a clean-up I made yesterday.

I was submitting from my desktop using your mongo db (I just applied for my own and I'm still learning how to use it). The error actually occurred around 10 pm yesterday.

haoyuanchen commented 8 years ago

Here restart_file points to a latest file with simulation info which was generated during the failed run. First index is cycle and second index is dimension.

Which exact file do you mean?

antonst commented 8 years ago

Which exact file do you mean?

During the simulation now are generated simulation_objects_x_x.pkl files in directory where you submit your run from

antonst commented 8 years ago

After simulaiton is done, these together with pairs_for_exchange files are moved to simulation_output directory

haoyuanchen commented 8 years ago

I'm now using my own MongoDB and the job seems to be running normally (not crashed till now). However, I'm still not seeing exchanges in umbrella sampling dimension either in TSU or TUU (unless an extremely small force constant is used).

haoyuanchen commented 8 years ago

To test the exchange codes, I'm currently trying to run replica exchange with Amber using different force constants and see the exchange rates.

haoyuanchen commented 8 years ago

To test the exchange codes, I'm currently trying to run replica exchange with Amber using different force constants and see the exchange rates.

For 2D umbrella sampling run with 8*8 replicas, with force constant of 6.56, there're only a few exchanges. With 0.656, there're lots of exchanges. @antonst In the large-scale simulation you've done, what's the force constant you were using? And how many exchanges did you observe?

antonst commented 8 years ago

what's the force constant you were using?

I was using 0.656

how many exchanges did you observe?

I don't remember exact rate, but definitely more than 60%

haoyuanchen commented 8 years ago

I was using 0.656 I don't remember exact rate, but definitely more than 60%

Good. I'll try same thing for TSU and see what happens.

haoyuanchen commented 8 years ago

In the TSU test, I'm not seeing exchanges using either 0.2 or 0.02 as force constants. However in the Amber run, there are some exchanges with 0.2 and a lot exchanges with 0.02. This suggests that there're some problems with the exchange in TSU. Since the case is kind of complicated now, I'll briefly summarize it here (correct/incorrect means the exchange rate is/isn't similar with Amber run):

Type	Exchange with old codes	Exchange with new codes
TUU	Incorrect	Correct
TSU	Correct	Incorrect

If the same exchange codes were used for TSU and TUU, then there might be problems elsewhere. As a temporary solution and test, I think we might want to use the simple neighbor exchange scheme for now, which shouldn't be too hard to implement. @taisung @antonst

antonst commented 8 years ago

I assume "new code" is devel branch, which branch is "old code"?

haoyuanchen commented 8 years ago

I assume "new code" is devel branch, which branch is "old code"?

New code is perfopt_gen, and old code is the code way back to last year which I used to generate the data for the first version of my manuscript (I believe the name was tuu_opt5). I've actually tried to revert back to tuu_opt5 but it doesn't work because RP has also been changed quite a lot from then.

antonst commented 8 years ago

Can you also share simulation input files (together with expected exchange rates) you have used to compose that table?

antonst commented 8 years ago

I believe the name was tuu_opt5

if you do git branch it will give you the name of the branch you are using

haoyuanchen commented 8 years ago

if you do git branch it will give you the name of the branch you are using

Yes, but that was what I used to generate all the data for the first version of my manuscript back in last year and I'm not using it now.

haoyuanchen commented 8 years ago

Can you also share simulation input files (together with expected exchange rates) you have used to compose that table?

I'll do that and send you by email.

antonst commented 8 years ago

I've actually tried to revert back to tuu_opt5 but it doesn't work because RP has also been changed quite a lot from then

you can always install an older version of RP, so this should not be a problem:

virtualenv $HOME/ve; source $HOME/ve/bin/activate
mkdir repex-exp
cd repex-exp 
wget https://pypi.python.org/packages/source/r/radical.pilot/radical.pilot-0.35.tar.gz 
tar -zxvf radical.pilot-0.35.tar.gz 
cd radical.pilot-0.35 
pip install . 
cd .. 
git clone https://github.com/radical-cybertools/radical.repex.git 
cd radical.repex
git checkout feature/tuu_opt5
python setup.py install 
cd examples/amber

antonst commented 8 years ago

I have compared the code for U exchange in devel and feature/perf_opt5 branches (for both TUU and TSU). Changes I have made are cosmetic and have no effect on result of exchange algorithm. This means that exchange rate problem was always there and was not introduced during refactoring. I guess to fix this, the algorithm should be modified, but I am not sure how I can help here.

antonst commented 8 years ago

I have also double checked for grouping of replicas and for group structure and did not found any anomalies.

antonst commented 8 years ago

Considering the title of this ticket would it be fair to assume that this issue was resolved?

haoyuanchen commented 8 years ago

Considering the title of this ticket would it be fair to assume that this issue was resolved?

Yes.

radical-cybertools / radical.repex.at

TUU stuck in first cycle #79