Closed YHRen closed 4 years ago
Hi Ray,
thanks for the report! Can you remind me of the cluster you are using for this?
The error you quote above is not the original error - this occurs during termination when the session tries to fetch profiles back to the local machine - and that fails because, as you rightly observed, there is no pilot sandbox on the remote machine. I would thus think that you will find an error message in your local client sandbox, in
$PWD/radical.pilot.sandbox/rp.session.js<some-text>/pmgr.0000.launching.0.child.log
Could you please check that file, look for an ERROR
log entry, and paste that entry and the exception trace next to it, if there is one?
Thanks!
Hi @andre-merzky
Thank you so much for getting back to me so quickly.
thanks for the report! Can you remind me of the cluster you are using for this?
I'm using xsede stampede2.
Could you please check that file, look for an
ERROR
log entry, and paste that entry and the exception trace next to it, if there is one?
Here is the trace in pmgr.0000.launching.0.child.log
:
256 2019-02-22 13:02:39,714: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : ERROR : bulk launch failed
257 Traceback (most recent call last):
258 File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/radical/pilot/pmgr/launching/default.py", line 492, in work
259 self._start_pilot_bulk(resource, schema, pilots)
260 File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/radical/pilot/pmgr/launching/default.py", line 658, in _start_pilot_bulk
261 js_tmp = rs.job.Service(js_url, session=self._session)
262 File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/job/service.py", line 115, in __init__
263 url, session, ttype=_ttype)
264 File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/base.py", line 104, in __init__
265 self._init_task = self._adaptor.init_instance (adaptor_state, *args, **kwargs)
266 File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
267 return sync_function (self, *args, **kwargs)
268 File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py", line 516, in init_instance
269 self.initialize ()
270 File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py", line 633, in initialize
271 raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out))
272 NoSuccess: failed to run bootstrap: (127)(/bin/sh: .saga/adaptors/shell_job//wrapper.sh: No such file or directory
273 ) (/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py +633 (initialize) : raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out)))
274 2019-02-22 13:02:39,714: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : advance bulk size: 1 [False, True]
Ah, you found it, great!
But I'll be damned - that is an error which pops up now and then, and I never manage to reproduce it and to track it down :( Apologies. But the good news is that there is an easy workaround: on stampede2, please run rm -r ~/.saga/
. Your application should get beyond this problem then, and it will not reappear anytime soon (the error seems to be triggered by some python module update).
Hi @andre-merzky
Thanks a lot!
Yes. removing the ~/.saga/
solves the problem :D
The mpi code runs and terminates (successfully I assume).
But here is the ERROR in local rp.session.js-169-25.jetstream-cloud.org.yren.017949.0004.log
file.
(I feel the title is not capturing the issue properly.
Let me know if I should start a new issue or change the title.)
2019-02-22 15:47:55,512: rp.session.js-169-25.jetstream-cloud.org.yren.017949.0004: MainProcess : MainThread : ERROR : failed to fet profile for pilot.0000
Traceback (most recent call last):
File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/radical/pilot/utils/session.py", line 172, in fetch_profiles
profiles = sandbox.list('*.prof')
File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/namespace/directory.py", line 243, in list
return self._adaptor.list (pattern, flags, ttype=ttype)
File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
return sync_function (self, *args, **kwargs)
File "/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_file.py", line 473, in list
% (ret, out))
NoSuccess: failed to list(): (2)(/bin/ls: cannot access *.prof: No such file or directory
) (/home/yren/anaconda2/envs/rad/lib/python2.7/site-packages/saga/adaptors/shell/shell_file.py +473 (list) : % (ret, out)))
trying the solution mentioned in #1737 Nope... the error remains...
Hi Ray,
cannot access *.prof
is a red herring I'm afraid: the RP examples attempt to download session profiles after completion, but that can fail if the pilot did not manage to run, and thus did not produce any profiles in the first place.
Can you please check if you see any exception logged in rp.session.js-169-25.jetstream-cloud.org.yren.017949.0004/pmgr.0000.launching.0.child.log
? If not, can you please check if the pilot sandbox exists on the remote host, and send a tarball of that sandbox? Thanks.
Hi @andre-merzky
Thank you so much for the feedback.
but that can fail if the pilot did not manage to run
I think pilot has launched the job successfully since squeue -u $whoami
has list the program as running for about 15 mins.
Can you please check if you see any exception logged in
rp.session.js-169-25.jetstream-cloud.org.yren.017949.0004/pmgr.0000.launching.0.child.log
There is no obvious error in rp.session.js-169-25.jetstream-cloud.org.yren.017949.0004/pmgr.0000.launching.0.child.log
(0006 now, I tried setting RADICAL_PROFILE=TRUE
)
Only has one WARNING.
I attached the last several lines at the end.
If not, can you please check if the pilot sandbox exists on the remote host, and send a tarball of that sandbox? Thanks.
I put the tarball here.
Thanks a ton!
320 2019-02-22 17:39:20,753: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG : saga job state: pilot.0000 Running
321 2019-02-22 17:39:31,015: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG : bulk states: [['Running']]
322 2019-02-22 17:39:31,246: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG : saga job state: pilot.0000 Running
323 2019-02-22 17:39:41,529: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG : bulk states: [['Running']]
324 2019-02-22 17:39:41,765: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG : saga job state: pilot.0000 Running
325 2019-02-22 17:39:54,024: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG : bulk states: [['Running']]
326 2019-02-22 17:39:54,258: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG : saga job state: pilot.0000 Running
327 2019-02-22 17:40:01,223: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.subscriber._cancel_monitor_cb: DEBUG : command ignored: cancel_pilots
328 2019-02-22 17:40:01,223: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.subscriber._pmgr_control_cb: DEBUG : launcher got {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
329 2019-02-22 17:40:01,223: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.subscriber._pmgr_control_cb: INFO : received pilot_cancel command (['pilot.0000'])
330 2019-02-22 17:40:01,235: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.watch: INFO : recv message: STOP
331 2019-02-22 17:40:01,235: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.watch: INFO : message received: STOP
332 2019-02-22 17:40:01,235: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.watch: INFO : STOP received: STOP
333 2019-02-22 17:40:01,235: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.watch: INFO : watcher closes
334 2019-02-22 17:40:01,235: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.watch: INFO : put message: [pmgr.0000.launching.0.child.watch.thread] work finished
335 2019-02-22 17:40:01,236: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.watch: DEBUG : ru_finalize_child (NOOP)
336 2019-02-22 17:40:01,236: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.watch: DEBUG : ru_finalize_common (NOOP)
337 2019-02-22 17:40:01,236: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.watch: INFO : put message: [pmgr.0000.launching.0.child.watch.thread] terminating
338 2019-02-22 17:40:01,438: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.subscriber._pmgr_control_cb: DEBUG : pilot(s).need(s) cancellation ['pilot.0000']
339 2019-02-22 17:40:01,438: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.subscriber._pmgr_control_cb: DEBUG : update cancel req: pilot.0000 1550875201.44
340 2019-02-22 17:40:01,452: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : WARNING : alive check: proc invalid - stop [False - True]
341 2019-02-22 17:40:01,505: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : INFO : stop pmgr.0000.launching.0.child (3707 : 3707 : MainThread) [radical.utils.process.Default.is_valid]
342 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : INFO : parent stops child 3707 -> 3707 [pmgr.0000.launching.0.child]
343 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : INFO : send message: [pmgr.0000.launching.0.child] STOP
344 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : WARNING : send failed ([Errno 9] Bad file descriptor) - terminate
345 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : TERM : pmgr.0000.launching.0.child unregister idler pmgr.0000.launching.0.child.idler._pilot_watcher_cb
346 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : INFO : parent stops child
347 2019-02-22 17:40:01,511: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : ru_finalize_parent (NOOP)
348 2019-02-22 17:40:01,512: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : ru_finalize_common (NOOP)
349 2019-02-22 17:40:01,540: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG : ru_finalize_child (NOOP)
350 2019-02-22 17:40:01,540: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: DEBUG : ru_finalize_common (NOOP)
351 2019-02-22 17:40:01,540: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : pmgr.0000.launching.0.child.idler._pilot_watcher_cb: INFO : put message: [pmgr.0000.launching.0.child.idler._pilot_watcher_cb.thread] terminating
352 2019-02-22 17:40:01,543: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : TERM : pmgr.0000.launching.0.child unregistered idler pmgr.0000.launching.0.child.idler._pilot_watcher_cb
353 2019-02-22 17:40:01,544: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : TERM : pmgr.0000.launching.0.child unregister subscriber pmgr.0000.launching.0.child.subscriber._pmgr_control_cb
354 2019-02-22 17:40:01,544: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : unregistered subscriber pmgr.0000.launching.0.child.subscriber._pmgr_control_cb
355 2019-02-22 17:40:01,569: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : pilot(s).need(s) cancellation ['pilot.0000']
356 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : update cancel req: pilot.0000 1550875201.57
357 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : killing pilots: ['pilot.0000']
358 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : killing pilots: [1550875201.570118]
359 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : killing pilots: last cancel: 1550875201.57
360 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : killing pilots: check 1550875201.57 < 1550875201.57 + 120
361 2019-02-22 17:40:01,570: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : killing pilots: alive pilot.0000
362 2019-02-22 17:40:02,572: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : killing pilots: check 1550875202.57 < 1550875201.57 + 120
363 2019-02-22 17:40:02,572: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : DEBUG : killing pilots: alive pilot.0000
Good news: the pilot and your workload actually executed correctly - its all there as expected in the pilot and unit sandboxes!
The quick workaround for that error during termination (/bin/ls: cannot access *.prof: No such file or directory
) would be to set RADICAL_PILOT_PROFILE=True
. I look into why the error gets triggered w/o it. Please let me know if that setting resolves the problem.
Thanks, Andre.
Hi @andre-merzky
Thanks! Previously, I set RADICAL_PROFILE=True
as you suggested in #1737.
So, I should also set RADICAL_PILOT_PROFILE=True
. Is it correct?
I will give it a try.
You are caught in a code transition unfortunately - the code should be backward compatible to either setting, but apparently is not. This is fixed in the upcoming release - but I would like you to stick with your current RCT stack as not to introduce new uncertainties. Setting both variables will remove that transition problem.
Thanks, Andre.
Got it. Thanks! stampede2 is under maintenance today. I will try tomorrow.
Hi @andre-merzky
After stampede2 maintenance, the code does not run anymore. I think the maintenance breaks the saga generated slurm scripts somehow.
#2019-02-27 14:30:44,827: radical.saga.cpi : pmgr.0000.launching.0 : Thread-2 : DEBUG : run_sync: mkdir -p /work/06078/tg853774/stampede2/radical.pilot.sandbox/rp.session.js-169-25.jetstream-cloud.org.yren.017954.0001/pilot.0000/
#2019-02-27 14:30:44,867: radical.saga.cpi : pmgr.0000.launching.0 : Thread-2 : INFO : SLURM script generated:
##!/bin/sh
#
##SBATCH --ntasks=68
##SBATCH --ntasks-per-node=68
##SBATCH --workdir /work/06078/tg853774/stampede2/radical.pilot.sandbox/rp.session.js-169-25.jetstream-cloud.org.yren.017954.0001/pilot.0000/
##SBATCH --output bootstrap_0.out
##SBATCH --error bootstrap_0.err
##SBATCH --partition normal
##SBATCH -J "pilot.0000"
##SBATCH --account TG-MCB090174
##SBATCH --time 01:00:00
#
### ENVIRONMENT
#export "RADICAL_PILOT_PROFILE"="TRUE"
#
### EXEC
#/bin/bash -l /work/06078/tg853774/stampede2/radical.pilot.sandbox/rp.session.js-169-25.jetstream-cloud.org.yren.017954.0001/bootstrap_0.sh -d 'radical.utils-0.50.3.tar.gz:saga-python-0.50.5.tar.gz:radical.pilot-0.50.22.tar.gz' -p 'pilot.0000' -s 'rp.session.js-169-25.jetstream-cloud.org.yren.017954.0001' -m 'create' -r 'local' -b 'default' -g 'default' -v '/work/06078/tg853774/stampede2/radical.pilot.sandbox/ve.xsede.stampede2_ssh.0.50.22' -y '60' -e 'module load TACC' -e 'module load intel/17.0.4' -e 'module load python/2.7.13'
#
#2019-02-27 14:30:46,019: radical.saga.cpi : pmgr.0000.launching.0 : Thread-2 : DEBUG : run_sync: sbatch 'tmp_6ti9KT.slurm'; rm -vf 'tmp_6ti9KT.slurm'
#2019-02-27 14:30:46,099: radical.saga.cpi : pmgr.0000.launching.0 : Thread-2 : DEBUG : staged/submit SLURM script (tmp_6ti9KT.slurm) (0)
#2019-02-27 14:30:46,099: radical.saga.cpi : pmgr.0000.launching.0 : Thread-2 : ERROR : NoSuccess: Couldn't get job id from submitted job! sbatch output:
#
#-----------------------------------------------------------------
# Welcome to the Stampede2 Supercomputer
#-----------------------------------------------------------------
#
#--> Submission error: please define total node count with the "-N" option
#removed ‘tmp_6ti9KT.slurm’
Ah bugger, nothing is ever easy... This needs fixing on the SAGA layer (our resource access layer), see referenced ticket. We'll try to do so quickly.
Thanks, @andre-merzky I wish the hpc clusters (all over the world) can provide some consistent API... Bests, Ray
Well, our SAGA layer is that consistent API - but it means that we need to play catch-up with the underlying native APIs... There were attempts to make those consistent as well, and to a certain degree that actually worked - but they are (and will be) different enough that it matters... sigh
There is now an open PR for this issue . You can also give the saga-python
branch fix/issue_706
a try and let us know if that resolves this problem.
@andre-merzky
This is probably a small issue.
I tried pip install -e git://github.com/saga-project/saga-python.git@fix/issue_706#egg=saga-python-fix-706
, and got radical-pilot 0.50.22 has requirement netifaces==0.10.4, but you'll have netifaces 0.10.8 which is incompatible.
I tried to downgrade the netifaces
but it seems the RP has dependency on version 0.10.8. Maybe to change dependency of netifaces to >=0.10.4
instead?
The code runs now. I will wait and report back soon :D Thanks a ton!
@andre-merzky
I have been waiting for over 30 mins, and no jobs have been launched. I think saga hangs somehow at this step:
Welcome to Stampede2, *please* read these important system notes:
--> Stampede2 user documentation is available at:
https://portal.tacc.utexas.edu/user-guides/stampede2
--------------------- Project balances for user tg853774 ----------------------
| Name Avail SUs Expires | |
| TG-MCB090174 115740 2019-09-30 | |
------------------------ Disk quotas for user tg853774 ------------------------
| Disk Usage (GB) Limit %Used File Usage Limit %Used |
| /home1 0.1 10.0 1.40 6425 200000 3.21 |
| /work 0.7 1024.0 0.07 21145 3000000 0.70 |
| /scratch 0.0 0.0 0.00 3 0 0.00 |
-------------------------------------------------------------------------------
Tip 144 (See "module help tacc_tips" for features or how to disable)
You can open a file and jump to a particular line with:
$ vi +10 <file>
)
2019-03-04 11:13:11,451: radical.saga.pty : MainProcess : MainThread : DEBUG : running command shell: exec /bin/sh -i
2019-03-04 11:13:11,451: radical.saga.pty : MainProcess : MainThread : DEBUG : write: [ 143] [ 30] ( stty -echo ; exec /bin/sh -i\n)
2019-03-04 11:13:11,671: radical.saga.pty : MainProcess : MainThread : DEBUG : read : [ 143] [ 15] (login1(1001)$ $)
2019-03-04 11:13:11,671: radical.saga.pty : MainProcess : MainThread : DEBUG : flush: [ 143] [ ] (flush pty read cache)
2019-03-04 11:13:11,709: radical.saga.pty : MainProcess : MainThread : DEBUG : read : [ 143] [ 1] ($)
2019-03-04 11:13:11,709: radical.saga.pty : MainProcess : MainThread : WARNING : flush: [ 143] [ 1] (discard data : '$')
2019-03-04 11:13:11,811: radical.saga.pty : MainProcess : MainThread : DEBUG : write: [ 143] [ 152] (set HISTFILE=$HOME/.saga_history; PS1='PROMPT-$?->'; PS2=''; PROMPT_COMMAND=''; export PS1 PS2 PROMPT_COMMAND 2>&1 >/dev/null; cd $HOME 2>&1 >/dev/null\n)
2019-03-04 11:13:11,848: radical.saga.pty : MainProcess : MainThread : DEBUG : read : [ 143] [ 10] (PROMPT-0->)
2019-03-04 11:13:11,848: radical.saga.pty : MainProcess : MainThread : DEBUG : got new shell prompt
2019-03-04 11:13:11,848: radical.saga.pty : MainProcess : MainThread : DEBUG : flush: [ 143] [ ] (flush pty read cache)
2019-03-04 11:13:11,950: radical.saga.pty : MainProcess : MainThread : DEBUG : flush: [ 143] [ ] (flush pty read cache)
2019-03-04 11:13:12,052: radical.saga.pty : MainProcess : MainThread : DEBUG : run_sync: echo "WORKDIR: $WORK"
2019-03-04 11:13:12,053: radical.saga.pty : MainProcess : MainThread : DEBUG : write: [ 143] [ 22] (echo "WORKDIR: $WORK"\n)
2019-03-04 11:13:12,089: radical.saga.pty : MainProcess : MainThread : DEBUG : read : [ 143] [ 51] (WORKDIR: /work/06078/tg853774/stampede2\nPROMPT-0->)
2019-03-04 11:13:12,090: radical.saga.pty : MainProcess : MainThread : DEBUG : PTYShell del <saga.utils.pty_shell.PTYShell object at 0x7f0ea6c8cc10>
2019-03-04 11:13:12,191: radical.saga.pty : MainProcess : MainThread : DEBUG : PTYProcess del <saga.utils.pty_process.PTYProcess object at 0x7f0ea44118d0>
(current time: Mon Mar 4 11:45:51 EST 2019)
Hello @YHRen, can you login in Stampede 2 and run: showq -u
. This will show us whether there is a pilot job in Stampede's 2 queue.
Thank you!
@iparask
Thank you for following up on this.
login1(1006)$ showq -u
SUMMARY OF JOBS FOR USER: <tg853774>
ACTIVE JOBS--------------------
JOBID JOBNAME USERNAME STATE NODES REMAINING STARTTIME
================================================================================
WAITING JOBS------------------------
JOBID JOBNAME USERNAME STATE NODES WCLIMIT QUEUETIME
================================================================================
Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0
@iparask
Don't know if this is useful.
I removed the ~/.saga/
on remote host (stampede2) and gave another try.
The process still hangs at
2019-03-04 12:45:50,020: radical.saga.pty : MainProcess : MainThread : DEBUG : PTYProcess del <saga.utils.pty_process.PTYProcess object at 0x7fc2e82e4b10>
Can you please attach the client side sandbox? Is a pilot sandbox created on Stampede2? Thanks, Andre.
Hi @andre-merzky
Thank you for following up.
here is the sandbox on the client side.
I don't think there is a sandbox created on stampede2, as there is no sub-directory of pilot and the file umgr.0000.staging.input.0.child.out
is empty.
The process hangs at PTYProcess del
for more than 30 mins and I have to kill by ctrl+c.
Please let me know if you need more information.
Hi @andre-merzky
Any input on what is happening on Stampede? I'm meeting with Ray and we were wondering if there was anything more that Ray can provide.
@andre-merzky ping. request status please.
This is actually fixed, we'll release it ASAP (was in review and testing)
Will be released in v.60 which is scheduled toward end of this week.
The release is delayed by a week or two - we are overloaded at the moment :/ Having said that: the current set of devel
branches for RU, RS and RP contain that fix and should be functional (they are the release candidate).
Closed as fixed.
might relate to this issue #1348 radical-stack:
The error complains about BadParameter, and "No such file or directory". The exact error traceback is attached at the end.
I checked the directory in question on stampede2. The parent directory
/work/06078/<username>/stampede2/radical.pilot.sandbox/rp.session.js<some-text>/
exsits, but the child directorypilot0000/
does not.Please let me know if you need more information.
Thank you. Best wishes, Ray