radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
Other
4 stars 3 forks source link

Repex-Amber fails on Comet #58

Closed iparask closed 8 years ago

iparask commented 8 years ago

I get this error in the debug file:

2015-10-18 18:21:53,517: radical.repex       : MainProcess                     : Thread-3       : ERROR   : Log: {'log': [<radical.pilot.logentry.Logentry object at 0x7f34894bafd0>, <radical.pilot.logentry.Logentry object at 0x7f34894ba610>], 'state': u'Failed', 'working_directory': u'sftp://comet.sdsc.xsede.org/home/iparask/radical.pilot.sandbox/rp.session.iparask-virtual-machine.iparask.016726.0000-pilot.0000//unit.000012', 'uid': 'unit.000012', 'submission_time': 1445206909.81107, 'execution_details': {u'control': u'agent', u'stdout': u'', u'Agent_Output_Directives': [], u'Agent_Output_Status': None, u'exec_locs': None, u'FTW_Input_Directives': [], u'log': [{u'timestamp': 1445206910.02961, u'message': u'Scheduled for data transfer to ComputePilot pilot.0000.'}, {u'timestamp': 1445206910.903352, u'message': u'push unit to agent after ftw staging'}], u'exit_code': 1, u'FTW_Input_Status': None, u'state': u'Failed', u'unitmanager': u'umgr.0000', u'statehistory': [{u'timestamp': 1445206909.81107, u'state': u'Scheduling'}, {u'timestamp': 1445206910.867602, u'state': u'StagingInput'}, {u'timestamp': 1445206910.867602, u'state': u'AgentStagingInputPending'}, {u'timestamp': 1445206911.066092, u'state': u'AgentStagingInputPending'}, {u'timestamp': 1445206911.138986, u'state': u'AgentStagingInput'}, {u'timestamp': 1445206911.140634, u'state': u'AllocatingPending'}, {u'timestamp': 1445206911.142342, u'state': u'Allocating'}, {u'timestamp': 1445206911.144213, u'state': u'ExecutingPending'}, {u'timestamp': 1445206911.14602, u'state': u'Executing'}, {u'timestamp': 1445206912.616244, u'state': u'AgentStagingOutputPending'}, {u'timestamp': 1445206912.618308, u'state': u'AgentStagingOutput'}, {u'timestamp': 1445206912.620143, u'state': u'Failed'}], u'pilot': u'pilot.0000', u'FTW_Output_Directives': [{u'target': u'pairs_for_exchange_0.dat', u'priority': 0, u'source': u'pairs_for_exchange_0.dat', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Transfer'}], u'pilot_sandbox': u'sftp://comet.sdsc.xsede.org/home/iparask/radical.pilot.sandbox/rp.session.iparask-virtual-machine.iparask.016726.0000-pilot.0000/', u'description': {u'kernel': None, u'executable': u'python', u'name': None, u'restartable': False, u'stdout': None, u'output_staging': [{u'action': u'Transfer', u'source': u'pairs_for_exchange_0.dat', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'pairs_for_exchange_0.dat', u'priority': 0}], u'pre_exec': [u'module load python', u'module load mpi4py/1.3.1', u'module load amber'], u'mpi': True, u'environment': None, u'cleanup': False, u'arguments': [u'global_ex_calculator_mpi.py', u'0', u'12', u'ala10_remd'], u'stderr': None, u'cores': 12, u'post_exec': None, u'input_staging': [{u'action': u'Copy', u'source': u'staging:///global_ex_calculator.py', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'global_ex_calculator.py', u'priority': 0}]}, u'restartable': False, u'started': None, u'FTW_Output_Status': u'New', u'finished': None, u'Agent_Input_Directives': [{u'target': u'global_ex_calculator.py', u'priority': 0, u'source': u'staging:///global_ex_calculator.py', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}], u'Agent_Input_Status': u'Pending', u'submitted': 1445206909.81107, u'sandbox': u'sftp://comet.sdsc.xsede.org/home/iparask/radical.pilot.sandbox/rp.session.iparask-virtual-machine.iparask.016726.0000-pilot.0000//unit.000012', u'stderr': u"[... CONTENT SHORTENED ...]\ness (rank: 8, pid: 19777) exited with status 2\npython: can't open file 'global_ex_calculator_mpi.py': [Errno 2] No such file or directory\npython: can't open file 'global_ex_calculator_mpi.py': [Errno 2] No such file or directory\npython: can't open file 'global_ex_calculator_mpi.py': [Errno 2] No such file or directory\npython: can't open file 'global_ex_calculator_mpi.py': [Errno 2] No such file or directory\npython: can't open file 'global_ex_calculator_mpi.py': [Errno 2] No such file or directory\n[comet-10-08.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 7, pid: 19776) exited with status 2\n[comet-10-08.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 5, pid: 19774) exited with status 2\n[comet-10-08.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 9, pid: 19778) exited with status 2\n[comet-10-08.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 11, pid: 19780) exited with status 2\n[comet-10-08.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 10, pid: 19779) exited with status 2\n", u'_id': u'unit.000012'}, 'stop_time': None, 'start_time': None, 'exit_code': 1, 'name': None}
2015-10-18 18:21:53,690: radical.repex       : MainProcess                     : MainThread     : ERROR   : Exchange step failed for unit:  unit.000012
2015-10-18 18:21:53,691: radical.repex       : MainProcess                     : MainThread     : INFO    : Exchanging replica configurations. cycle 1
Traceback (most recent call last):
  File "/home/iparask/TestRepo/RepexTesting/bin/repex-amber", line 58, in <module>
    pilot_kernel.run_simulation( replicas, pilot_object, session, md_kernel )
  File "/home/iparask/TestRepo/RepexTesting/local/lib/python2.7/site-packages/pilot_kernels/pilot_kernel_pattern_s.py", line 296, in run_simulation
    md_kernel.do_exchange(current_cycle, replicas)
  File "/home/iparask/TestRepo/RepexTesting/local/lib/python2.7/site-packages/amber_kernels_tex/kernel_pattern_s_tex.py", line 468, in do_exchange
    f = open(infile)
IOError: [Errno 2] No such file or directory: 'pairs_for_exchange_0.dat'

What else do you need?

antonst commented 8 years ago

Can you please re-run this example? Also are you using input file from repex.examples repo?

iparask commented 8 years ago

Re-install and rerun??

I am using the command you provided in the testing protocol

antonst commented 8 years ago

Which repex version this reports?

antonst commented 8 years ago

and this is with exchange_mpi = False ?

iparask commented 8 years ago

Sorry I do not remember. What ever it was on Sunday. Now I updated it so it is 0.2.6.

iparask commented 8 years ago

This is my configuration:

{
    "remd.input": {
        "re_pattern": "S",
        "exchange": "T-REMD",
        "number_of_cycles": "3",
        "number_of_replicas": "12",
        "input_folder": "t_remd_inputs",
        "input_file_basename": "ala10_remd",
        "amber_input": "ala10.mdin",
        "amber_parameters": "ala10.prmtop",
        "amber_coordinates": "ala10_minimized.inpcrd",
        "replica_mpi": "False",
        "replica_cores": "1",
        "exchange_mpi": "False",
        "min_temperature": "300",
        "max_temperature": "600",
        "steps_per_cycle": "4000",
        "download_mdinfo": "True",
        "download_mdout" : "True"
    }
}

and

{
    "target": {
        "resource": "xsede.comet",
        "username" : "iparask",
        "project" : "<correct_number",
        "runtime" : "60",
        "cleanup" : "False",
        "cores" : "12"
    }
}
antonst commented 8 years ago

If you specify project should be good

iparask commented 8 years ago

The project is specified. I will not put the project number on github

iparask commented 8 years ago

And it failed again. With the new run. Here is the debug file: https://gist.github.com/iparask/19ab5ed60cc5233472af

iparask commented 8 years ago

And just in case: https://gist.github.com/iparask/ceea47e4e2daac7c4806

The debug messages from the first run

antonst commented 8 years ago

Can you please try to re-run using devel and resource input file:

{
    "target": {
        "resource": "xsede.comet",
        "username" : "octocat",
        "project" : "toxic-crusaders",
        "queue" : "compute",
        "runtime" : "60",
        "cleanup" : "False",
        "cores" : "48"
    }
}

and simulation input file:

{
    "remd.input": {
        "re_pattern": "S",
        "exchange": "T-REMD",
        "number_of_cycles": "3",
        "input_folder": "t_remd_inputs",
        "input_file_basename": "ala10_remd",
        "amber_input": "ala10.mdin",
        "amber_parameters": "ala10.prmtop",
        "amber_coordinates_folder": "ala10_coors",
        "same_coordinates": "True",
        "replica_mpi": "False",
        "replica_cores": "1",
        "min_temperature": "300",
        "max_temperature": "600",
        "steps_per_cycle": "4000",
        "download_mdinfo": "True",
        "download_mdout" : "True"
    },
    "dim.input": {
        "d1": {
            "type" : "temperature",
            "number_of_replicas": "8",
            "min_temperature": "300.0",
            "max_temperature": "302.0"
        }
    }
}

p.s. it worked for me

antonst commented 8 years ago

closed due to lack of response

iparask commented 8 years ago

I do not even remember what to run.... Specify the test or give me a link to the testing protocol

antonst commented 8 years ago
export RADICAL_REPEX_VERBOSE=info
export RADICAL_PILOT_VERBOSE=info
git clone https://github.com/radical-cybertools/radical.repex.git
cd radical.repex
git checkout devel
python setup.py install
cd examples/amber
repex-amber --input='tuu_remd_ace_ala_nme.json' --rconfig='comet.json'

you only need to modify comet.json