Closed ashkurti closed 9 years ago
In the last file linked (output with debug), I see near the end:
2014:10:04 18:04:03 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54302862e14fa2479af64377' state changed from 'Executing' to 'Done'.
[Callback]: ComputeUnit '54302862e14fa2479af64377' state: Done.
2014:10:04 18:04:03 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54302862e14fa2479af64371' state changed from 'PendingExecution' to 'Scheduling'.
[Callback]: ComputeUnit '54302862e14fa2479af64371' state: Scheduling.
2014:10:04 18:04:03 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54302862e14fa2479af64371' state changed from 'Scheduling' to 'Executing'.
[Callback]: ComputeUnit '54302862e14fa2479af64371' state: Executing.
2014:10:04 18:04:03 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54302862e14fa2479af64371' state changed from 'Executing' to 'Done'.
[Callback]: ComputeUnit '54302862e14fa2479af64371' state: Done.
All CUs completed successfully!
Closing session, exiting now ...
2014:10:04 18:04:03 radical.pilot.MainProcess: [INFO ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots.
2014:10:04 18:04:04 radical.pilot.MainProcess: [INFO ] ComputePilot '54302861e14fa2479af6436c' state changed from 'Active' to 'Canceled'.
[Callback]: ComputePilot '54302861e14fa2479af6436c' state: Canceled.
2014:10:04 18:04:04 radical.pilot.MainProcess: [INFO ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots.
2014:10:04 18:04:04 radical.pilot.MainProcess: [DEBUG ] PilotManager.close(): PilotLauncherWorker-1 terminated.
2014:10:04 18:04:04 radical.pilot.MainProcess: [DEBUG ] Worker thread (ID: Thread-1[140377090459392]) for PilotManager 54302860e14fa2479af6436b stopped.
2014:10:04 18:04:04 radical.pilot.MainProcess: [INFO ] Closed PilotManager 54302860e14fa2479af6436b.
2014:10:04 18:04:05 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): InputFileTransferWorker-1 terminated.
2014:10:04 18:04:05 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): InputFileTransferWorker-2 terminated.
2014:10:04 18:04:05 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): OutputFileTransferWorker-1 terminated.
2014:10:04 18:04:05 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): OutputFileTransferWorker-2 terminated.
2014:10:04 18:04:05 radical.pilot.MainProcess: [DEBUG ] Worker thread (ID: Thread-3[140376700610304]) for UnitManager 54302861e14fa2479af6436d stopped.
2014:10:04 18:04:05 radical.pilot.MainProcess: [INFO ] Closed UnitManager 54302861e14fa2479af6436d.
2014:10:04 18:04:05 radical.pilot.MainProcess: [INFO ] Deleted session 5430285ee14fa2479af6436a from database.
2014:10:04 18:04:05 radical.pilot.MainProcess: [INFO ] Closed Session 5430285ee14fa2479af6436a.
and then a number of error messages which (falsely) get shown during shutdown (see https://github.com/radical-cybertools/radical.pilot/issues/310). So to me it seems that the code actually works as intended. The total time for execution seems to be about 1.5 minutes, which sounds about right.
Since you said It does not terminate, I interrupted it after an hour.
, I assume that you get no output after the last line, which would be
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'error'
? If that is not the case, would you be able to poinpoint the exact line in the output where it seems to hang?
Also, I am surprised by the statement simple_bot.py is modified from pdesc.resource = "local.localhost" to pdesc.resource = "localhost"
-- this seems to indicate a version problem or an installation problem. How was radical.pilot installed? What is the result of radicalpilot-version
?
Thanks!
Your assumption is right. I get no output after the last line that is: [Callback]: ComputePilot '5430225ae14fa23fd85b64cf' state: Canceled.
But it terminates by itself if the DEBUG is activated.
[ExTASY-toolsOct2] ardita@jekyll 138% radicalpilot-version 2014:10:04 18:33:29 radical.pilot.MainProcess: [INFO ] radical.pilot version: 0.20 v0.20
In addition as reported from the output with the debug option activated there is a complaint about the mpi module not found, while the mpi module is there loaded at every open shell. Its version is:
[ExTASY-toolsOct2] ardita@jekyll 139% mpiexec --version Intel(R) MPI Library for Linux* OS, 64-bit applications, Version 4.1.3 Build 20140226 Copyright (C) 2003-2014 Intel Corporation. All rights reserved.
I looked at radical-cybertools/radical.pilot#310 but I have the number of cores set at 1: pdesc.cores = 1 ...
I checked version tag v0.20
, and at that tag we had the simple_bot.py
resource tag as localhost
, not as local.localhost
. So there I think this is a mismatch between the documentation version you used -- where did you look at the tutorial pages, if I may ask?
But it terminates by itself if the DEBUG is activated.
Uh, that should not happen. I'll try to reproduce this -- would you please let me know how exactly you installed radical.pilot?
Thanks!
File simple_bot.py taken from https://raw.githubusercontent.com/radical-cybertools/radical.pilot/devel/examples/tutorial/simple_bot.py and instructions at http://radicalpilot.readthedocs.org/en/latest/tutorial/simple_bot.html were followed.
I installed radical.pilot as described at step 2 of section 1 (Installation) at https://github.com/radical-cybertools/ExTASY/blob/devel/README.md
File simple_bot.py taken from https://raw.githubusercontent.com/radical-cybertools/radical.pilot/devel/examples/tutorial/simple_bot.py and instructions at http://radicalpilot.readthedocs.org/en/latest/tutorial/simple_bot.html were followed.
Ah, that explains it. The link you see on the tutorial page you mention, on readthedocs, points you to a different version of the simple_bot.py
-- the devel branch contains changes to the resource tags, so does not match that tutorial version. In general, please consider the devel branch to be unstable, for all practical purposes... You should, however, always be able to refer to the master branch (which will in general be in sync with the release).
Right, I indicated these links from the beginning of the description of the issue.
In addition the issue is raised based on the devel branch since we are performing the testing of the ExTASY tools at that level ... preparing for the next release candidate ...
Would it be less confusing if we tagged the issues as ExTASY - testing at this point?
Alas, I cannot reproduce the hanging problem -- the same version and same example work fine for me. Even switched my login shell to tcsh ;)
When that code hangs and you press CTRL-C
, is there anything like a python stacktrace printed on your terminal?
Right, I indicated these links from the beginning of the description of the issue.
Right again, sorry for missing that blush
Would it be less confusing if we tagged the issues as ExTASY - testing at this point?
I don't mind either way -- just happy that its not a radicalpilot level problem :D The haning code though is an RP problem -- but I think its ok to leave the ticket in this tracker...
Ok, I tried now after unsetting the RADICAL_PILOT_VERVBOSE variable twice. It ends now by itself and the last two lines are: [Callback]: ComputeUnit '54306954e14fa25ed15d8c32' state: Scheduling. [Callback]: ComputeUnit '54306954e14fa25ed15d8c2e' state: Done. [Callback]: ComputeUnit '54306954e14fa25ed15d8c32' state: Done. All CUs completed successfully! Closing session, exiting now ... [Callback]: ComputePilot '54306952e14fa25ed15d8c28' state: Canceled. [ExTASY-toolsOct2] ardita@jekyll 148%
And again setting the RADICAL_PILOT_VERVBOSE environment variable to DEBUG I get the following last lines ... that differ from what I obtained before...
[Callback]: ComputeUnit '54306a41e14fa2663b9e367b' state: Scheduling. 2014:10:04 22:45:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54306a41e14fa2663b9e367b' state changed from 'Scheduling' to 'Done'. [Callback]: ComputeUnit '54306a41e14fa2663b9e367b' state: Done. All CUs completed successfully! Closing session, exiting now ... 2014:10:04 22:45:11 radical.pilot.MainProcess: [INFO ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots. 2014:10:04 22:45:13 radical.pilot.MainProcess: [INFO ] ComputePilot '54306a40e14fa2663b9e3675' state changed from 'Active' to 'Canceled'. [Callback]: ComputePilot '54306a40e14fa2663b9e3675' state: Canceled. 2014:10:04 22:45:13 radical.pilot.MainProcess: [INFO ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] PilotManager.close(): PilotLauncherWorker-1 terminated. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] Worker thread (ID: Thread-1[140480773818112]) for PilotManager 54306a3fe14fa2663b9e3674 stopped. 2014:10:04 22:45:13 radical.pilot.MainProcess: [INFO ] Closed PilotManager 54306a3fe14fa2663b9e3674. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): InputFileTransferWorker-1 terminated. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): InputFileTransferWorker-2 terminated. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): OutputFileTransferWorker-1 terminated. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): OutputFileTransferWorker-2 terminated. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] Worker thread (ID: Thread-3[140480383805184]) for UnitManager 54306a41e14fa2663b9e3676 stopped. 2014:10:04 22:45:13 radical.pilot.MainProcess: [INFO ] Closed UnitManager 54306a41e14fa2663b9e3676. 2014:10:04 22:45:14 radical.pilot.MainProcess: [INFO ] Deleted session 54306a3de14fa2663b9e3673 from database. 2014:10:04 22:45:14 radical.pilot.MainProcess: [INFO ] Closed Session 54306a3de14fa2663b9e3673. [ExTASY-toolsOct2] ardita@jekyll 150%
Those different exit messages all look ok, really. The important line here is
Closing session, exiting now ...
After that, the debug mode will show a variety of messages which are really only for debugging, and have no bearing on the operation of the code itself (that is all done and finished).
That those debug messages vary is because the shutdown of the multithreaded pilot framework is non-deterministic: depending on what thread gets closed first, the amount and types of messages differ. That is expected.
The only problem I saw so far is that the code seems to sometimes hang for you during that shutdown. If you discover that again, please send the stack traces printed by python after you press CTRL-C
-- those should show which parts of the code stopped the regular shutdown.
Hi Ardita, am I assuming correctly that things are working for you now after you switched to the non-devel version of radical-pilot and simple-bot.py?
Not noticed any anomalous behaviours during this second round of the ExTASY tools testing, while performing local test, although in my case I need to use pdesc.resource = "localhost" instead of pdesc.resource = "local.localhost"
However it is not working for me while trying to execute it on Stampede: Changes at simple_bot.py:
pdesc.resource = "stampede.tacc.utexas.edu" pdesc.project = "TG-MCB090174" c.user_id = "ardi"
Error obtained: 2014:11:04 14:53:30 radical.pilot.MainProcess: [DEBUG ] read : [ 15] [ 55](Shared connection to stampede.tacc.utexas.edu closed.n) 2014:11:04 14:53:30 radical.pilot.MainProcess: [ERROR ] Pilot launching failed: Insufficient system resources: Insufficient system resources: process I/O failed: Shared connection to stampede.tacc.utexas.edu closed. ((Shared connection to stampede.tacc.utexas.edu closed. )) (/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/utils/pty_exceptions.py +56 (translate_exception) : e = se.NoSuccess ("Insufficient system resources: %s" % cmsg)) Traceback (most recent call last): File "/users/ardita/.local/lib/python2.7/site-packages/radical.pilot-0.19-py2.7.egg/radical/pilot/controller/pilot_launcher_worker.py", line 343, in run agent_script.copy("%s/radical-pilot-agent.py" % str(sandbox)) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/namespace/entry.py", line 273, in copy ret = self._adaptor.copy_self (tgt_url, flags, ttype=ttype) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/adaptors/cpi/decorators.py", line 51, in wrap_function return sync_function (self, _args, *_kwargs) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/adaptors/shell/shell_file.py", line 1146, in copy_self copy_shell = self._get_copy_shell (tgt) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/adaptors/shell/shell_file.py", line 919, in _get_copy_shell self.copy_shell = sups.PTYShell (tgt, self.session, self._logger) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/utils/pty_shell.py", line 212, in init self.pty_shell = self.factory.run_shell (self.pty_info) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/utils/pty_shell_factory.py", line 417, in run_shell self._initialize_pty (sh_slave, info, is_shell=True) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/utils/pty_shell_factory.py", line 379, in _initialize_pty raise ptye.translate_exception (e) NoSuccess: Insufficient system resources: Insufficient system resources: process I/O failed: Shared connection to stampede.tacc.utexas.edu closed. ((Shared connection to stampede.tacc.utexas.edu closed. )) (/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/utils/pty_exceptions.py +56 (translate_exception) : e = se.NoSuccess ("Insufficient system resources: %s" % cmsg))
Have removed everything related to radical and saga at the .local/ folders of my local linux workstation from which I am carrying the testing out. The bag of tasks is still failing with the following error:
2014:11:04 18:14:52 radical.pilot.MainProcess: [ERROR ] Pilot launching failed: read from process failed '[Errno 5] Input/output error' : (Connecting to stampede.tacc.utexas.edu... Couldn't read packet: Connection reset by peer ) (/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/utils/pty_process.py +643 (read) : % (e, self.tail))) Traceback (most recent call last): File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 466, in run pilotjob.run() File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/job/job.py", line 397, in run return self._adaptor.run (ttype=ttype) File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 51, in wrap_function return sync_function (self, _args, *_kwargs) File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/adaptors/slurm/slurm_job.py", line 1190, in run self._id = self.js._job_run (self.jd) File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/adaptors/slurm/slurm_job.py", line 580, in _job_run self.shell.stage_to_remote (src=fname, tgt=tgt) File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/utils/pty_shell.py", line 900, in stage_to_remote raise ptye.translate_exception (e) NoSuccess: read from process failed '[Errno 5] Input/output error' : (Connecting to stampede.tacc.utexas.edu... Couldn't read packet: Connection reset by peer ) (/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/utils/pty_process.py +643 (read) : % (e, self.tail)))
2014:11:04 18:14:52 radical.pilot.MainProcess: [INFO ] ComputePilot '5459178bf8cdba48d668cdcf' state changed from 'Launching' to 'Failed'. [Callback]: ComputePilot '5459178bf8cdba48d668cdcf' state: Failed. 2014:11:04 18:14:52 radical.pilot.MainProcess: [ERROR ] pilot manager controller thread caught system exit -- forcing application shutdown Traceback (most recent call last): File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 293, in run self.call_callbacks(pilot_id, new_state) File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 211, in call_callbacks cb(self._shared_data[pilot_id]['facade_object'](), new_state) File "simple_bot.py", line 24, in pilot_state_cb sys.exit (1) SystemExit: 1 Execution was interrupted Closing session, exiting now ...
Can you, on command line, try to login to stampede? Is that working w/o asking for a password?
If that is the case, please run the application once more after setting
RADICAL_PILOT_VERBOSE=DEBUG
SAGA_VERBOSE=DEBUG
and send the resulting (large) output per email.
Thanks!
Andre: In case this helps: This was discussed in today's ExTASY release meeting. Ole detected that SAGA 0.14 was being used.
Can you, on command line, try to login to stampede? Is that working w/o asking for a password?
If that is the case, please run the application once more after setting
RADICAL_PILOT_VERBOSE=DEBUG SAGA_VERBOSE=DEBUG and send the resulting (large) output per email.
Thanks!
— Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/ExTASY/issues/71#issuecomment-61746426.
Thanks for the feedback!
My last comment related to a problem verified after the call, after specifically removing everything related to radical and saga from the .local folder of the local workstation. In this new case the latest saga version has been used ... I will shortly post the large output!
I have pasted the output including the debug at https://gist.github.com/ashkurti/4a3924825b57a7f6a5d4 for the bag of tasks and https://gist.github.com/ashkurti/acb47f5d3b4f7428e4b4 for the coco-amber workflow, that presents the same problem.
And yes I have a passwordless access to stampede:
[ExTASY-tools] ardita@tirith 125% ssh ardi@login3.stampede.tacc.utexas.edu
Last login: Tue Nov 4 13:22:56 2014 from poirot.pharm.nottingham.ac.uk
------------------------------------------------------------------------------
Welcome to the Stampede Supercomputer
Texas Advanced Computing Center, The University of Texas at Austin
------------------------------------------------------------------------------
** Unauthorized use/access is prohibited. **
If you log on to this computer system, you acknowledge your awareness
of and concurrence with the UT Austin Acceptable Use Policy. The
University will prosecute violators to the full extent of the law.
TACC Usage Policies:
http://www.tacc.utexas.edu/user-services/usage-policies/
______________________________________________________________________________
Questions and Problem Reports:
--> XD Projects: help@xsede.org (email)
--> TACC Projects: portal.tacc.utexas.edu (web)
Documentation: http://www.tacc.utexas.edu/user-services/user-guides/
User News: http://www.tacc.utexas.edu/user-services/user-news/
______________________________________________________________________________
Welcome to Stampede, *please* read these important system notes:
--> Stampede is currently running the SLURM resource manager to
schedule all compute resources. Example SLURM job scripts are
available on the system at /share/doc/slurm
To run an interactive shell, issue:
srun -p development -t 0:30:00 -n 32 --pty /bin/bash -l
To submit a batch job, issue: sbatch job.mpi
To show all queued jobs, issue: showq
To kill a queued job, issue: scancel <jobId>
See "man slurm" or the Stampede user guide for more detailed information.
--> To see all the software that is available across all compilers and
mpi stacks, issue: "module spider"
--> To see which software packages are available with your currently loaded
compiler and mpi stack, issue: "module avail"
--> Stampede has three parallel file systems: $HOME (permanent,
quota'd, backed-up) $WORK (permanent, quota'd, not backed-up) and
$SCRATCH (high-speed purged storage). The "cdw" and "cds" aliases
are provided as a convenience to change to your $WORK and $SCRATCH
directories, respectively.
______________________________________________________________________________
----------------------- Project balances for user ardi ------------------------
| Name Avail SUs Expires | Name Avail SUs Expires |
| TG-MCB090174 79220 2015-09-30 | TG-TRA140016 -81080 2015-05-06 |
-------------------------- Disk quotas for user ardi --------------------------
| Disk Usage (GB) Limit %Used File Usage Limit %Used |
| /home1 0.4 5.0 8.06 16582 150000 11.05 |
| /work 0.0 1024.0 0.00 2683 3000000 0.09 |
-------------------------------------------------------------------------------
Tip 38 (See "module help tacc_tips" for features or how to disable)
No need to retype previous commands; use Ctrl+R to search for them.
login3.stampede(1)$
File simple_bot.py taken from https://raw.githubusercontent.com/radical-cybertools/radical.pilot/devel/examples/tutorial/simple_bot.py and instructions at http://radicalpilot.readthedocs.org/en/latest/tutorial/simple_bot.html were followed.