radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

BadParameter("Invalid dir ...") #1823

Closed YHRen closed 4 years ago

YHRen commented 5 years ago

might relate to this issue #1348 radical-stack:

  python               : 2.7.15
  pythonpath           : 
  virtualenv           : rad

  radical.pilot        : 0.50.22
  radical.utils        : 0.50.3
  saga                 : 0.50.5

The error complains about BadParameter, and "No such file or directory". The exact error traceback is attached at the end.

I checked the directory in question on stampede2. The parent directory /work/06078/<username>/stampede2/radical.pilot.sandbox/rp.session.js<some-text>/ exsits, but the child directory pilot0000/ does not.

Please let me know if you need more information.

Thank you. Best wishes, Ray

263 Traceback (most recent call last):
264   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/radical/pilot/utils/session.py", line 113, in fetch_pro    files
265     sandbox = rs.fs.Directory (sandbox_url, session=session)
266   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/filesystem/directory.py", line 95, in __init__
267     _adaptor, _adaptor_state, _ttype=_ttype)
268   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/namespace/directory.py", line 95, in __init__
269     _adaptor, _adaptor_state, _ttype=_ttype)
270   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/namespace/entry.py", line 89, in __init__
271     url, flags, session, ttype=_ttype)
272   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/base.py", line 104, in __init__
273     self._init_task = self._adaptor.init_instance (adaptor_state, *args, **kwargs)
274   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 57, in wrap_func    tion
275     return sync_function (self, *args, **kwargs)
276   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_file.py", line 292, in init_i    nstance
277     self.initialize()
278   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_file.py", line 350, in initia    lize
279     raise saga.BadParameter("invalid dir '%s': %s" % (path, out))
280 BadParameter: invalid dir '/work/06078/tg853774/stampede2/radical.pilot.sandbox/rp.session.js-169-25.jetstream-cloud.org.    yren.017949.0003/pilot.0000/': sh: cd: /work/06078/tg853774/stampede2/radical.pilot.sandbox/rp.session.js-169-25.jetstrea    m-cloud.org.yren.017949.0003/pilot.0000/: No such file or directory
andre-merzky commented 5 years ago

Hi Ray,

thanks for the report! Can you remind me of the cluster you are using for this?

The error you quote above is not the original error - this occurs during termination when the session tries to fetch profiles back to the local machine - and that fails because, as you rightly observed, there is no pilot sandbox on the remote machine. I would thus think that you will find an error message in your local client sandbox, in

$PWD/radical.pilot.sandbox/rp.session.js<some-text>/pmgr.0000.launching.0.child.log

Could you please check that file, look for an ERROR log entry, and paste that entry and the exception trace next to it, if there is one?

Thanks!

YHRen commented 5 years ago

Hi @andre-merzky

Thank you so much for getting back to me so quickly.

thanks for the report! Can you remind me of the cluster you are using for this?

I'm using xsede stampede2.

Could you please check that file, look for an ERROR log entry, and paste that entry and the exception trace next to it, if there is one?

Here is the trace in pmgr.0000.launching.0.child.log:

256 2019-02-22 13:02:39,714: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : ERROR   : bulk launch failed
257 Traceback (most recent call last):
258   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/radical/pilot/pmgr/launching/default.py", line 492, in work
259     self._start_pilot_bulk(resource, schema, pilots)
260   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/radical/pilot/pmgr/launching/default.py", line 658, in _start_pilot_bulk
261     js_tmp  = rs.job.Service(js_url, session=self._session)
262   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/job/service.py", line 115, in __init__
263     url, session, ttype=_ttype)
264   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/base.py", line 104, in __init__
265     self._init_task = self._adaptor.init_instance (adaptor_state, *args, **kwargs)
266   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
267     return sync_function (self, *args, **kwargs)
268   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py", line 516, in init_instance
269     self.initialize ()
270   File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py", line 633, in initialize
271     raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out))
272 NoSuccess: failed to run bootstrap: (127)(/bin/sh: .saga/adaptors/shell_job//wrapper.sh: No such file or directory
273 ) (/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py +633 (initialize)  :  raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out)))
274 2019-02-22 13:02:39,714: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : advance bulk size: 1 [False, True]
andre-merzky commented 5 years ago

Ah, you found it, great!

But I'll be damned - that is an error which pops up now and then, and I never manage to reproduce it and to track it down :( Apologies. But the good news is that there is an easy workaround: on stampede2, please run rm -r ~/.saga/. Your application should get beyond this problem then, and it will not reappear anytime soon (the error seems to be triggered by some python module update).

YHRen commented 5 years ago

Hi @andre-merzky

Thanks a lot! Yes. removing the ~/.saga/ solves the problem :D The mpi code runs and terminates (successfully I assume).

But here is the ERROR in local rp.session.js-169-25.jetstream-cloud.org.yren.017949.0004.log file. (I feel the title is not capturing the issue properly. Let me know if I should start a new issue or change the title.)

2019-02-22 15:47:55,512: rp.session.js-169-25.jetstream-cloud.org.yren.017949.0004: MainProcess                     : MainThread     : ERROR   : failed to fet profile for pilot.0000
Traceback (most recent call last):
  File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/radical/pilot/utils/session.py", line 172, in fetch_profiles
    profiles = sandbox.list('*.prof')
  File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/namespace/directory.py", line 243, in list
    return self._adaptor.list (pattern, flags, ttype=ttype)
  File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_file.py", line 473, in list
    % (ret, out))
NoSuccess: failed to list(): (2)(/bin/ls: cannot access *.prof: No such file or directory
) (/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_file.py +473 (list)  :  % (ret, out)))
YHRen commented 5 years ago

trying the solution mentioned in #1737 Nope... the error remains...

andre-merzky commented 5 years ago

Hi Ray,

cannot access *.prof is a red herring I'm afraid: the RP examples attempt to download session profiles after completion, but that can fail if the pilot did not manage to run, and thus did not produce any profiles in the first place.

Can you please check if you see any exception logged in rp.session.js-169-25.jetstream-cloud.org.yren.017949.0004/pmgr.0000.launching.0.child.log? If not, can you please check if the pilot sandbox exists on the remote host, and send a tarball of that sandbox? Thanks.

YHRen commented 5 years ago

Hi @andre-merzky

Thank you so much for the feedback.

but that can fail if the pilot did not manage to run

I think pilot has launched the job successfully since squeue -u $whoami has list the program as running for about 15 mins.

Can you please check if you see any exception logged in rp.session.js-169-25.jetstream-cloud.org.yren.017949.0004/pmgr.0000.launching.0.child.log

There is no obvious error in rp.session.js-169-25.jetstream-cloud.org.yren.017949.0004/pmgr.0000.launching.0.child.log (0006 now, I tried setting RADICAL_PROFILE=TRUE ) Only has one WARNING. I attached the last several lines at the end.

If not, can you please check if the pilot sandbox exists on the remote host, and send a tarball of that sandbox? Thanks.

I put the tarball here.

Thanks a ton!

320 2019-02-22 17:39:20,753: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG   : saga job state: pilot.0000 Running
321 2019-02-22 17:39:31,015: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG   : bulk states: [['Running']]
322 2019-02-22 17:39:31,246: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG   : saga job state: pilot.0000 Running
323 2019-02-22 17:39:41,529: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG   : bulk states: [['Running']]
324 2019-02-22 17:39:41,765: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG   : saga job state: pilot.0000 Running
325 2019-02-22 17:39:54,024: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG   : bulk states: [['Running']]
326 2019-02-22 17:39:54,258: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG   : saga job state: pilot.0000 Running
327 2019-02-22 17:40:01,223: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.subscriber._cancel_monitor_cb: DEBUG   : command ignored: cancel_pilots
328 2019-02-22 17:40:01,223: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.subscriber._pmgr_control_cb: DEBUG   : launcher got {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
329 2019-02-22 17:40:01,223: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.subscriber._pmgr_control_cb: INFO    : received pilot_cancel command (['pilot.0000'])
330 2019-02-22 17:40:01,235: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.watch: INFO    : recv message: STOP
331 2019-02-22 17:40:01,235: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.watch: INFO    : message received: STOP
332 2019-02-22 17:40:01,235: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.watch: INFO    : STOP received: STOP
333 2019-02-22 17:40:01,235: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.watch: INFO    : watcher closes
334 2019-02-22 17:40:01,235: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.watch: INFO    : put message: [pmgr.0000.launching.0.child.watch.thread] work finished
335 2019-02-22 17:40:01,236: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.watch: DEBUG   : ru_finalize_child (NOOP)
336 2019-02-22 17:40:01,236: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.watch: DEBUG   : ru_finalize_common (NOOP)
337 2019-02-22 17:40:01,236: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.watch: INFO    : put message: [pmgr.0000.launching.0.child.watch.thread] terminating
338 2019-02-22 17:40:01,438: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.subscriber._pmgr_control_cb: DEBUG   : pilot(s).need(s) cancellation ['pilot.0000']
339 2019-02-22 17:40:01,438: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.subscriber._pmgr_control_cb: DEBUG   : update cancel req: pilot.0000 1550875201.44
340 2019-02-22 17:40:01,452: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : WARNING : alive check: proc invalid - stop [False - True]
341 2019-02-22 17:40:01,505: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : INFO    : stop pmgr.0000.launching.0.child (3707 : 3707 : MainThread) [radical.utils.process.Default.is_valid]
342 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : INFO    : parent stops child  3707 -> 3707 [pmgr.0000.launching.0.child]
343 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : INFO    : send message: [pmgr.0000.launching.0.child] STOP
344 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : WARNING : send failed ([Errno 9] Bad file descriptor) - terminate
345 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : TERM : pmgr.0000.launching.0.child unregister idler pmgr.0000.launching.0.child.idler._pilot_watcher_cb
346 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : INFO    : parent stops child
347 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : ru_finalize_parent (NOOP)
348 2019-02-22 17:40:01,512: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : ru_finalize_common (NOOP)
349 2019-02-22 17:40:01,540: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG   : ru_finalize_child (NOOP)
350 2019-02-22 17:40:01,540: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG   : ru_finalize_common (NOOP)
351 2019-02-22 17:40:01,540: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: INFO    : put message: [pmgr.0000.launching.0.child.idler._pilot_watcher_cb.thread] terminating
352 2019-02-22 17:40:01,543: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : TERM : pmgr.0000.launching.0.child unregistered idler pmgr.0000.launching.0.child.idler._pilot_watcher_cb
353 2019-02-22 17:40:01,544: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : TERM : pmgr.0000.launching.0.child unregister subscriber pmgr.0000.launching.0.child.subscriber._pmgr_control_cb
354 2019-02-22 17:40:01,544: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : unregistered subscriber pmgr.0000.launching.0.child.subscriber._pmgr_control_cb
355 2019-02-22 17:40:01,569: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : pilot(s).need(s) cancellation ['pilot.0000']
356 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : update cancel req: pilot.0000 1550875201.57
357 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : killing pilots: ['pilot.0000']
358 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : killing pilots: [1550875201.570118]
359 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : killing pilots: last cancel: 1550875201.57
360 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : killing pilots: check 1550875201.57 < 1550875201.57 + 120
361 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : killing pilots: alive pilot.0000
362 2019-02-22 17:40:02,572: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : killing pilots: check 1550875202.57 < 1550875201.57 + 120
363 2019-02-22 17:40:02,572: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : DEBUG   : killing pilots: alive pilot.0000
andre-merzky commented 5 years ago

Good news: the pilot and your workload actually executed correctly - its all there as expected in the pilot and unit sandboxes!

The quick workaround for that error during termination (/bin/ls: cannot access *.prof: No such file or directory) would be to set RADICAL_PILOT_PROFILE=True. I look into why the error gets triggered w/o it. Please let me know if that setting resolves the problem.

Thanks, Andre.

YHRen commented 5 years ago

Hi @andre-merzky

Thanks! Previously, I set RADICAL_PROFILE=True as you suggested in #1737. So, I should also set RADICAL_PILOT_PROFILE=True. Is it correct? I will give it a try.

andre-merzky commented 5 years ago

You are caught in a code transition unfortunately - the code should be backward compatible to either setting, but apparently is not. This is fixed in the upcoming release - but I would like you to stick with your current RCT stack as not to introduce new uncertainties. Setting both variables will remove that transition problem.

Thanks, Andre.

YHRen commented 5 years ago

Got it. Thanks! stampede2 is under maintenance today. I will try tomorrow.

YHRen commented 5 years ago

Hi @andre-merzky

After stampede2 maintenance, the code does not run anymore. I think the maintenance breaks the saga generated slurm scripts somehow.

#2019-02-27 14:30:44,827: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-2       : DEBUG   : run_sync: mkdir -p /work/06078/tg853774/stampede2/radical.pilot.sandbox/rp.session.js-169-25.jetstream-cloud.org.yren.017954.0001/pilot.0000/
#2019-02-27 14:30:44,867: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-2       : INFO    : SLURM script generated:
##!/bin/sh
#
##SBATCH --ntasks=68
##SBATCH --ntasks-per-node=68
##SBATCH --workdir /work/06078/tg853774/stampede2/radical.pilot.sandbox/rp.session.js-169-25.jetstream-cloud.org.yren.017954.0001/pilot.0000/
##SBATCH --output bootstrap_0.out
##SBATCH --error bootstrap_0.err
##SBATCH --partition normal
##SBATCH -J "pilot.0000"
##SBATCH --account TG-MCB090174
##SBATCH --time 01:00:00
#
### ENVIRONMENT
#export "RADICAL_PILOT_PROFILE"="TRUE"
#
### EXEC
#/bin/bash -l /work/06078/tg853774/stampede2/radical.pilot.sandbox/rp.session.js-169-25.jetstream-cloud.org.yren.017954.0001/bootstrap_0.sh  -d 'radical.utils-0.50.3.tar.gz:saga-python-0.50.5.tar.gz:radical.pilot-0.50.22.tar.gz' -p 'pilot.0000' -s 'rp.session.js-169-25.jetstream-cloud.org.yren.017954.0001' -m 'create' -r 'local' -b 'default' -g 'default' -v '/work/06078/tg853774/stampede2/radical.pilot.sandbox/ve.xsede.stampede2_ssh.0.50.22' -y '60' -e 'module load TACC' -e 'module load intel/17.0.4' -e 'module load python/2.7.13'
#
#2019-02-27 14:30:46,019: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-2       : DEBUG   : run_sync: sbatch 'tmp_6ti9KT.slurm'; rm -vf 'tmp_6ti9KT.slurm'
#2019-02-27 14:30:46,099: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-2       : DEBUG   : staged/submit SLURM script (tmp_6ti9KT.slurm) (0)
#2019-02-27 14:30:46,099: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-2       : ERROR   : NoSuccess: Couldn't get job id from submitted job! sbatch output:
#
#-----------------------------------------------------------------
#          Welcome to the Stampede2 Supercomputer                 
#-----------------------------------------------------------------
#
#--> Submission error: please define total node count with the "-N" option
#removed ‘tmp_6ti9KT.slurm’
andre-merzky commented 5 years ago

Ah bugger, nothing is ever easy... This needs fixing on the SAGA layer (our resource access layer), see referenced ticket. We'll try to do so quickly.

YHRen commented 5 years ago

Thanks, @andre-merzky I wish the hpc clusters (all over the world) can provide some consistent API... Bests, Ray

andre-merzky commented 5 years ago

Well, our SAGA layer is that consistent API - but it means that we need to play catch-up with the underlying native APIs... There were attempts to make those consistent as well, and to a certain degree that actually worked - but they are (and will be) different enough that it matters... sigh

There is now an open PR for this issue . You can also give the saga-python branch fix/issue_706 a try and let us know if that resolves this problem.

YHRen commented 5 years ago

@andre-merzky

This is probably a small issue. I tried pip install -e git://github.com/saga-project/saga-python.git@fix/issue_706#egg=saga-python-fix-706, and got radical-pilot 0.50.22 has requirement netifaces==0.10.4, but you'll have netifaces 0.10.8 which is incompatible. I tried to downgrade the netifaces but it seems the RP has dependency on version 0.10.8. Maybe to change dependency of netifaces to >=0.10.4 instead?

The code runs now. I will wait and report back soon :D Thanks a ton!

YHRen commented 5 years ago

@andre-merzky

I have been waiting for over 30 mins, and no jobs have been launched. I think saga hangs somehow at this step:

Welcome to Stampede2, *please* read these important system notes:

--> Stampede2 user documentation is available at:
       https://portal.tacc.utexas.edu/user-guides/stampede2

--------------------- Project balances for user tg853774 ----------------------
| Name           Avail SUs     Expires |                                      |
| TG-MCB090174      115740  2019-09-30 |                                      |
------------------------ Disk quotas for user tg853774 ------------------------
| Disk         Usage (GB)     Limit    %Used   File Usage       Limit   %Used |
| /home1              0.1      10.0     1.40         6425      200000    3.21 |
| /work               0.7    1024.0     0.07        21145     3000000    0.70 |
| /scratch            0.0       0.0     0.00            3           0    0.00 |
-------------------------------------------------------------------------------

Tip 144   (See "module help tacc_tips" for features or how to disable)

   You can open a file and jump to a particular line with:
       $ vi +10 <file>

)
2019-03-04 11:13:11,451: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : running command shell: exec /bin/sh -i
2019-03-04 11:13:11,451: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  143] [   30] ( stty -echo ; exec /bin/sh -i\n)
2019-03-04 11:13:11,671: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : read : [  143] [   15] (login1(1001)$ $)
2019-03-04 11:13:11,671: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : flush: [  143] [     ] (flush pty read cache)
2019-03-04 11:13:11,709: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : read : [  143] [    1] ($)
2019-03-04 11:13:11,709: radical.saga.pty    : MainProcess                     : MainThread     : WARNING : flush: [  143] [    1] (discard data : '$')
2019-03-04 11:13:11,811: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  143] [  152] (set HISTFILE=$HOME/.saga_history; PS1='PROMPT-$?->'; PS2=''; PROMPT_COMMAND=''; export PS1 PS2 PROMPT_COMMAND 2>&1 >/dev/null; cd $HOME 2>&1 >/dev/null\n)
2019-03-04 11:13:11,848: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : read : [  143] [   10] (PROMPT-0->)
2019-03-04 11:13:11,848: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : got new shell prompt
2019-03-04 11:13:11,848: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : flush: [  143] [     ] (flush pty read cache)
2019-03-04 11:13:11,950: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : flush: [  143] [     ] (flush pty read cache)
2019-03-04 11:13:12,052: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : run_sync: echo "WORKDIR: $WORK"
2019-03-04 11:13:12,053: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  143] [   22] (echo "WORKDIR: $WORK"\n)
2019-03-04 11:13:12,089: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : read : [  143] [   51] (WORKDIR: /work/06078/tg853774/stampede2\nPROMPT-0->)
2019-03-04 11:13:12,090: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : PTYShell del  <saga.utils.pty_shell.PTYShell object at 0x7f0ea6c8cc10>
2019-03-04 11:13:12,191: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : PTYProcess del  <saga.utils.pty_process.PTYProcess object at 0x7f0ea44118d0>
(current time: Mon Mar  4 11:45:51 EST 2019)
iparask commented 5 years ago

Hello @YHRen, can you login in Stampede 2 and run: showq -u. This will show us whether there is a pilot job in Stampede's 2 queue.

Thank you!

YHRen commented 5 years ago

@iparask

Thank you for following up on this.

login1(1006)$ showq -u

SUMMARY OF JOBS FOR USER: <tg853774>

ACTIVE JOBS--------------------
JOBID     JOBNAME    USERNAME      STATE   NODES REMAINING STARTTIME
================================================================================

WAITING JOBS------------------------
JOBID     JOBNAME    USERNAME      STATE   NODES WCLIMIT   QUEUETIME
================================================================================

Total Jobs: 0     Active Jobs: 0     Idle Jobs: 0     Blocked Jobs: 0   
YHRen commented 5 years ago

@iparask

Don't know if this is useful. I removed the ~/.saga/ on remote host (stampede2) and gave another try. The process still hangs at

2019-03-04 12:45:50,020: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : PTYProcess del  <saga.utils.pty_process.PTYProcess object at 0x7fc2e82e4b10>
andre-merzky commented 5 years ago

Can you please attach the client side sandbox? Is a pilot sandbox created on Stampede2? Thanks, Andre.

YHRen commented 5 years ago

Hi @andre-merzky

Thank you for following up. here is the sandbox on the client side. I don't think there is a sandbox created on stampede2, as there is no sub-directory of pilot and the file umgr.0000.staging.input.0.child.out is empty. The process hangs at PTYProcess del for more than 30 mins and I have to kill by ctrl+c.

Please let me know if you need more information.

shantenujha commented 5 years ago

Hi @andre-merzky

Any input on what is happening on Stampede? I'm meeting with Ray and we were wondering if there was anything more that Ray can provide.

shantenujha commented 5 years ago

@andre-merzky ping. request status please.

andre-merzky commented 5 years ago

This is actually fixed, we'll release it ASAP (was in review and testing)

andre-merzky commented 5 years ago

Will be released in v.60 which is scheduled toward end of this week.

andre-merzky commented 5 years ago

The release is delayed by a week or two - we are overloaded at the moment :/ Having said that: the current set of devel branches for RU, RS and RP contain that fix and should be functional (they are the release candidate).

andre-merzky commented 4 years ago

Closed as fixed.