radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
Other
4 stars 3 forks source link

Amber 2d example hangs #14

Closed antonst closed 9 years ago

antonst commented 9 years ago

With 32 replicas on Stampede execution stops and this error is generated:

2014:12:16 00:00:08 radical.pilot.MainProcess: [ERROR ] Input transfer failed: read from process failed '[Errno 5] Input/output error' : (t "//home/ubuntu/repex10/RepEx/examples/amber/amber_pattern_b_2d/ala10 ^H_remd_896_0.mdin" "/work/02457/antontre/radical.pilot.sandbox/pilot-548f74b61982 ^H2070e72ccf36/unit-548f75de19822070e72cd2b9/ala10_remd_896_0.mdin" Couldn't send packet: Broken pipe ) (/home/ubuntu/222env/local/lib/python2.7/site-packages/saga_python-0.22-py2.7.egg/saga/utils/pty_process.py +643 (read) : % (e, self.tail))) Traceback (most recent call last): File "/home/ubuntu/222env/lib/python2.7/site-packages/radical.pilot-0.21-py2.7.egg/radical/pilot/controller/input_file_transfer_worker.py", line 162, in run input_file.copy(target) File "/home/ubuntu/222env/local/lib/python2.7/site-packages/saga_python-0.22-py2.7.egg/saga/namespace/entry.py", line 276, in copy ret = self._adaptor.copy_self (tgt_url, flags, ttype=ttype) File "/home/ubuntu/222env/local/lib/python2.7/site-packages/saga_python-0.22-py2.7.egg/saga/adaptors/cpi/decorators.py", line 51, in wrap_function return sync_function (self, _args, *_kwargs) File "/home/ubuntu/222env/local/lib/python2.7/site-packages/saga_python-0.22-py2.7.egg/saga/adaptors/shell/shell_file.py", line 1172, in copy_self files_copied = copy_shell.stage_to_remote (src.path, tgt.path, rec_flag) File "/home/ubuntu/222env/local/lib/python2.7/site-packages/saga_python-0.22-py2.7.egg/saga/utils/pty_shell.py", line 900, in stage_to_remote raise ptye.translate_exception (e) NoSuccess: read from process failed '[Errno 5] Input/output error' : (t "//home/ubuntu/repex10/RepEx/examples/amber/amber_pattern_b_2d/ala10 ^H_remd_896_0.mdin" "/work/02457/antontre/radical.pilot.sandbox/pilot-548f74b61982 ^H2070e72ccf36/unit-548f75de19822070e72cd2b9/ala10_remd_896_0.mdin" Couldn't send packet: Broken pipe ) (/home/ubuntu/222env/local/lib/python2.7/site-packages/saga_python-0.22-py2.7.egg/saga/utils/pty_process.py +643 (read) : % (e, self.tail)))

2014:12:16 00:00:08 28903 InputFileTransferWorker-2 radical.utils : [DEBUG ] lm release object

Worked just fine with smaller number of replicas.

andre-merzky commented 9 years ago

Antons, could you please provide a full log for the failing run? Thanks!

antonst commented 9 years ago

I will Andre, having that run I am currently performing will fail with the same error :-)

haoyuanchen commented 9 years ago

Brian told me before that there seems to be a limit on total file counts on some XSEDE clusters. Since it fails with 32 replicas (actually it's 32x32=1024, right?) but works with a smaller system, could this be a possible reason? There might be quite a lot of files produced in a 1024 replica simulation.

antonst commented 9 years ago

You are correct, we might eventually hit the file count limit, but as of now this seems to be a file transfer issue. As of now my file usage on Stampede work file-system is: 167549 while limit is: 3000000.

Thanks, Antons

andre-merzky commented 9 years ago

Antons, can you please retry with the current SAGA-Python release? Thanks!