radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
4 stars 3 forks source link

Input transfer failed error in RepEx #31

Closed taisung closed 8 years ago

taisung commented 9 years ago
2015:06:10 10:53:09 18579  Thread-3     radical.pilot         : [INFO    ] RUN ComputeUnit 'unit.000342' state changed from 'StagingOutput' to 'Done'.
2015:06:10 10:53:09 18579  Thread-3     radical.repex.pk-patternB-multiD: [INFO    ] ComputeUnit 'unit.000342' state changed to Done.
2015:06:10 10:53:09 18579  Thread-3     radical.pilot         : [INFO    ] RUN ComputeUnit 'unit.000343' state changed from 'StagingOutput' to 'Done'.
2015:06:10 10:53:09 18579  Thread-3     radical.repex.pk-patternB-multiD: [INFO    ] ComputeUnit 'unit.000343' state changed to Done.
2015:06:10 10:53:09 18579  Thread-3     radical.pilot         : [INFO    ] RUN ComputeUnit 'unit.000348' state changed from 'StagingInput' to 'Allocating'.
2015:06:10 10:53:09 18579  Thread-3     radical.repex.pk-patternB-multiD: [INFO    ] ComputeUnit 'unit.000348' state changed to Allocating.
2015:06:10 10:53:09 18579  Thread-3     radical.pilot         : [INFO    ] RUN ComputeUnit 'unit.000349' state changed from 'StagingInput' to 'PendingExecution'.
2015:06:10 10:53:09 18579  Thread-3     radical.repex.pk-patternB-multiD: [INFO    ] ComputeUnit 'unit.000349' state changed to PendingExecution.
2015:06:10 10:53:09 18579  InputFileTransferWorker-2 radical.pilot         : [DEBUG   ] read : [  100] [   81] (//media/sf_USB/Research_work/ActiveProjects/C 100%  266     0.3KB/s   00:00    \n)
2015:06:10 10:53:09 18579  InputFileTransferWorker-1 radical.pilot         : [DEBUG   ] write: [    9] [  122] (mkdir -p /work/00661/tg458185/radical.pilot.sandbox/rp.session.taisung-fedora.taisung.016596.0003-pilot.0000/unit.000378/\n)
2015:06:10 10:53:09 18579  InputFileTransferWorker-2 radical.pilot         : [DEBUG   ] read : [  100] [    6] (sftp> )
2015:06:10 10:53:09 18579  InputFileTransferWorker-2 radical.pilot         : [DEBUG   ] copy done: ['mput', 'Uploading', '//media/sf_USB/Research_work/ActiveProjects/C', 'sftp>']
2015:06:10 10:53:09 18579  InputFileTransferWorker-1 radical.pilot         : [DEBUG   ] read : [    9] [   10] (PROMPT-0->)
2015:06:10 10:53:09 18579  InputFileTransferWorker-1 radical.pilot         : [DEBUG   ] flush: [    9] [     ] (flush pty read cache)
2015:06:10 10:53:09 18579  InputFileTransferWorker-2 radical.pilot         : [ERROR   ] {'timestamp': datetime.datetime(2015, 6, 10, 14, 53, 9, 194275), 'message': 'Input transfer failed: cannot release object -- not managed'}
Traceback (most recent call last):
  File "/usr/people/taisung/myenv/lib/python2.7/site-packages/radical/pilot/controller/input_file_transfer_worker.py", line 188, in run
  File "/usr/people/taisung/myenv/lib/python2.7/site-packages/saga/filesystem/file.py", line 178, in close
    return self._adaptor.close ()
  File "/usr/people/taisung/myenv/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/usr/people/taisung/myenv/lib/python2.7/site-packages/saga/adaptors/shell/shell_file.py", line 1079, in close
    self.finalize (kill=True)
  File "/usr/people/taisung/myenv/lib/python2.7/site-packages/saga/adaptors/shell/shell_file.py", line 1063, in finalize
    self.lm.release (self.local)
  File "/usr/people/taisung/myenv/lib/python2.7/site-packages/radical/utils/lease_manager.py", line 416, in release
    raise RuntimeError ("cannot release object -- not managed")
RuntimeError: cannot release object -- not managed
2015:06:10 10:53:09 18579  InputFileTransferWorker-2 radical.pilot         : [INFO    ] Creating ComputeUnit sandbox directory sftp://stampede.tacc.utexas.edu/work/00661/tg458185/radical.pilot.sandbox/rp.session.taisung-fedora.taisung.016596.0003-pilot.0000//unit.000379.
2015:06:10 10:53:09 18579  InputFileTransferWorker-2 radical.pilot         : [DEBUG   ] saga.fs.Directory ('sftp://stampede.tacc.utexas.edu/work/00661/tg458185/radical.pilot.sandbox/rp.session.taisung-fedora.taisung.016596.0003-pilot.0000//unit.000379')
2015:06:10 10:53:09 18579  InputFileTransferWorker-2 radical.pilot         : [DEBUG   ] flush: [   10] [     ] (flush pty read cache)
2015:06:10 10:53:09 18579  InputFileTransferWorker-1 radical.pilot         : [DEBUG   ] flush: [   92] [     ] (flush pty read cache)
2015:06:10 10:53:09 18579  InputFileTransferWorker-2 radical.pilot         : [DEBUG   ] write: [   10] [  146] (mkdir -p / && cd / &&  mkdir -p '/work/00661/tg458185/radical.pilot.sandbox/rp.session.taisung-fedora.taisung.016596.0003-pilot.0000/unit.000379'\n)
2015:06:10 10:53:09 18579  InputFileTransferWorker-2 radical.pilot         : [DEBUG   ] read : [   10] [   10] (PROMPT-0->)
2015:06:10 10:53:09 18579  InputFileTransferWorker-2 radical.pilot         : [INFO    ] Processing input file transfers for ComputeUnit unit.000379
2015:06:10 10:53:09 18579  Thread-3     radical.pilot         : [INFO    ] RUN ComputeUnit 'unit.000348' state changed from 'Allocating' to 'Executing'.
2015:06:10 10:53:09 18579  Thread-3     radical.repex.pk-patternB-multiD: [INFO    ] ComputeUnit 'unit.000348' state changed to Executing.
2015:06:10 10:53:09 18579  Thread-3     radical.pilot         : [INFO    ] RUN ComputeUnit 'unit.000349' state changed from 'PendingExecution' to 'Executing'.
2015:06:10 10:53:09 18579  Thread-3     radical.repex.pk-patternB-multiD: [INFO    ] ComputeUnit 'unit.000349' state changed to Executing.
2015:06:10 10:53:09 18579  Thread-3     radical.pilot         : [INFO    ] RUN ComputeUnit 'unit.000375' state changed from 'StagingInput' to 'Failed'.
2015:06:10 10:53:09 18579  Thread-3     radical.repex.pk-patternB-multiD: [INFO    ] ComputeUnit 'unit.000375' state changed to Failed.
2015:06:10 10:53:09 18579  Thread-3     radical.repex.pk-patternB-multiD: [ERROR   ] Log: {'log': [<radical.pilot.logentry.Logentry object at 0x7fde8f872890>, <radical.pilot.logentry.Logentry object at 0x7fde8f872d50>, <radical.pilot.logentry.Logentry object at 0x7fde8f872810>, <radical.pilot.logentry.Logentry object at 0x7fde8f872e50>], 'state': u'Failed', 'working_directory': u'sftp://stampede.tacc.utexas.edu/work/00661/tg458185/radical.pilot.sandbox/rp.session.taisung-fedora.taisung.016596.0003-pilot.0000//unit.000375', 'uid': 'unit.000375', 'submission_time': datetime.datetime(2015, 6, 10, 14, 52, 29, 168000), 'execution_details': {u'stdout': None, u'Agent_Output_Directives': [], u'Agent_Output_Status': None, u'exec_locs': None, u'FTW_Input_Directives': [{u'target': u'ace_ala_nme.mdin', u'priority': 0, u'source': u'/media/sf_USB/Research_work/ActiveProjects/CDI-SAGA/workspace/RepEx/examples/amber_pattern_b_3d_tuu/amber_inp/ace_ala_nme.mdin', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Transfer'}], u'log': [{u'timestamp': datetime.datetime(2015, 6, 10, 14, 52, 29, 260000), u'message': u'Scheduled for data transfer to ComputePilot pilot.0000.'}, {u'timestamp': datetime.datetime(2015, 6, 10, 14, 52, 29, 684000), u'message': u'unit needs input staging'}, {u'timestamp': datetime.datetime(2015, 6, 10, 14, 52, 29, 715000), u'message': u"Copy'ed /work/00661/tg458185/radical.pilot.sandbox/rp.session.taisung-fedora.taisung.016596.0003-pilot.0000/staging_area/matrix_calculator_us_ex.py to /work/00661/tg458185/radical.pilot.sandbox/rp.session.taisung-fedora.taisung.016596.0003-pilot.0000/unit.000375/matrix_calculator_us_ex.py - success"}, {u'timestamp': datetime.datetime(2015, 6, 10, 14, 53, 9, 194000), u'message': u'Input transfer failed: cannot release object -- not managed'}], u'exit_code': None, u'FTW_Input_Status': u'Executing', u'state': u'Failed', u'unitmanager': u'55784b7bbc3ea948935fb5a8', u'statehistory': [{u'timestamp': datetime.datetime(2015, 6, 10, 14, 52, 29, 167000), u'state': u'Scheduling'}, {u'timestamp': datetime.datetime(2015, 6, 10, 14, 52, 29, 591000), u'state': u'StagingInput'}, {u'timestamp': datetime.datetime(2015, 6, 10, 14, 52, 29, 684000), u'state': u'StagingInput'}, {u'timestamp': datetime.datetime(2015, 6, 10, 14, 52, 59, 957000), u'state': u'StagingInput'}, {u'timestamp': datetime.datetime(2015, 6, 10, 14, 53, 9, 194000), u'state': u'Failed'}], u'pilot': u'pilot.0000', u'FTW_Output_Directives': [{u'target': u'matrix_column_55_2.dat', u'priority': 0, u'source': u'matrix_column_55_2.dat', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Transfer'}], u'pilot_sandbox': u'sftp://stampede.tacc.utexas.edu/work/00661/tg458185/radical.pilot.sandbox/rp.session.taisung-fedora.taisung.016596.0003-pilot.0000/', u'description': {u'kernel': None, u'executable': u'python', u'name': None, u'restartable': False, u'stdout': None, u'output_staging': [{u'action': u'Transfer', u'source': u'matrix_column_55_2.dat', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'matrix_column_55_2.dat', u'priority': 0}], u'pre_exec': [u'module load TACC', u'module load amber/12.0'], u'mpi': False, u'environment': None, u'cleanup': False, u'arguments': [u'matrix_calculator_us_ex.py', u'{"replica_cycle": "2", "current_group_rst": {"55": "ace_ala_nme_us.RST.55", "54": "ace_ala_nme_us.RST.54", "53": "ace_ala_nme_us.RST.53", "52": "ace_ala_nme_us.RST.52"}, "base_name": "ace_ala_nme_remd", "replicas": "64", "amber_input": "ace_ala_nme.mdin", "amber_parameters": "ace_ala_nme.parm7", "init_temp": "377.976314968", "replica_id": "55"}'], u'stderr': None, u'cores': 1, u'post_exec': None, u'input_staging': [{u'action': u'Transfer', u'source': u'/media/sf_USB/Research_work/ActiveProjects/CDI-SAGA/workspace/RepEx/examples/amber_pattern_b_3d_tuu/amber_inp/ace_ala_nme.mdin', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'ace_ala_nme.mdin', u'priority': 0}, {u'action': u'Copy', u'source': u'staging:///matrix_calculator_us_ex.py', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'matrix_calculator_us_ex.py', u'priority': 0}, {u'action': u'Copy', u'source': u'staging:///ace_ala_nme_us.RST.52', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'ace_ala_nme_us.RST.52', u'priority': 0}, {u'action': u'Copy', u'source': u'staging:///ace_ala_nme_us.RST.53', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'ace_ala_nme_us.RST.53', u'priority': 0}, {u'action': u'Copy', u'source': u'staging:///ace_ala_nme_us.RST.54', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'ace_ala_nme_us.RST.54', u'priority': 0}, {u'action': u'Copy', u'source': u'staging:///ace_ala_nme_us.RST.55', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'ace_ala_nme_us.RST.55', u'priority': 0}, {u'action': u'Copy', u'source': u'staging:///ace_ala_nme_remd_55_2.rst', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'ace_ala_nme_remd_55_2.rst', u'priority': 0}]}, u'restartable': False, u'started': None, u'FTW_Output_Status': u'New', u'finished': None, u'Agent_Input_Directives': [{u'target': u'matrix_calculator_us_ex.py', u'priority': 0, u'source': u'staging:///matrix_calculator_us_ex.py', u'state': u'Done', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}, {u'target': u'ace_ala_nme_us.RST.52', u'priority': 0, u'source': u'staging:///ace_ala_nme_us.RST.52', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}, {u'target': u'ace_ala_nme_us.RST.53', u'priority': 0, u'source': u'staging:///ace_ala_nme_us.RST.53', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}, {u'target': u'ace_ala_nme_us.RST.54', u'priority': 0, u'source': u'staging:///ace_ala_nme_us.RST.54', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}, {u'target': u'ace_ala_nme_us.RST.55', u'priority': 0, u'source': u'staging:///ace_ala_nme_us.RST.55', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}, {u'target': u'ace_ala_nme_remd_55_2.rst', u'priority': 0, u'source': u'staging:///ace_ala_nme_remd_55_2.rst', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}], u'Agent_Input_Status': u'Done', u'submitted': datetime.datetime(2015, 6, 10, 14, 52, 29, 168000), u'sandbox': u'sftp://stampede.tacc.utexas.edu/work/00661/tg458185/radical.pilot.sandbox/rp.session.taisung-fedora.taisung.016596.0003-pilot.0000//unit.000375', u'stderr': None, u'_id': u'unit.000375'}, 'stop_time': None, 'start_time': None, 'exit_code': None, 'name': None}
antonst commented 9 years ago

Same as #620 of RP?

antonst commented 9 years ago

Taisung can you please do:

add the following entries to your ~/.saga.cfg:

connection_pool_size = 20
connection_pool_ttl = 1200
connection_pool_wait = 1200

Does this solve the problem?

taisung commented 9 years ago

Actually I found this is not reproducible. Sometimes it happens sometime it doesn’t.

This is not a good thing. I will see if I can find a way to reproduce it.


andre-merzky commented 9 years ago

indeed, we would very much prefer this to be reproducible :( How often does the problem occur though? Do the settings proposed by Antons make any difference at all?


antonst commented 9 years ago

By the way there is not such file .saga.cfg on my system, should this be created from scratch Andre?

andre-merzky commented 9 years ago

Yes, you can create it. It is not mandatory, so if the file is not present in $HOME, SAGA will use default settings.

antonst commented 9 years ago

Thanks Andre, now I understand. I got confused by "your", which implies that it is already present.

antonst commented 9 years ago

Which results in:

radical.utils.config.config.ValueTypeError: Option saga.utils.pty.connection_pool_ttl requires value of type '<type 'int'>' but got '<type 'str'>'.
antonst commented 8 years ago

I haven't see this one of a while. Does anyone encounter this issue as of now? If not I am closing this ticket.

antonst commented 8 years ago

closing due to lack of response