radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

Bag of tasks fails locally #71

Closed ashkurti closed 9 years ago

ashkurti commented 9 years ago

File simple_bot.py taken from https://raw.githubusercontent.com/radical-cybertools/radical.pilot/devel/examples/tutorial/simple_bot.py and instructions at http://radicalpilot.readthedocs.org/en/latest/tutorial/simple_bot.html were followed.

  1. If simple_bot.py is not modified the output is as https://gist.github.com/ashkurti/921846ca66b94214af6a
  2. If line 78 of simple_bot.py is modified from pdesc.resource = "local.localhost" to pdesc.resource = "localhost" than the following occurs:
andre-merzky commented 9 years ago

In the last file linked (output with debug), I see near the end:

2014:10:04 18:04:03 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54302862e14fa2479af64377' state changed from 'Executing' to 'Done'.
[Callback]: ComputeUnit  '54302862e14fa2479af64377' state: Done.
2014:10:04 18:04:03 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54302862e14fa2479af64371' state changed from 'PendingExecution' to 'Scheduling'.
[Callback]: ComputeUnit  '54302862e14fa2479af64371' state: Scheduling.
2014:10:04 18:04:03 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54302862e14fa2479af64371' state changed from 'Scheduling' to 'Executing'.
[Callback]: ComputeUnit  '54302862e14fa2479af64371' state: Executing.
2014:10:04 18:04:03 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54302862e14fa2479af64371' state changed from 'Executing' to 'Done'.
[Callback]: ComputeUnit  '54302862e14fa2479af64371' state: Done.
All CUs completed successfully!
Closing session, exiting now ...
2014:10:04 18:04:03 radical.pilot.MainProcess: [INFO    ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots.
2014:10:04 18:04:04 radical.pilot.MainProcess: [INFO    ] ComputePilot '54302861e14fa2479af6436c' state changed from 'Active' to 'Canceled'.
[Callback]: ComputePilot '54302861e14fa2479af6436c' state: Canceled.
2014:10:04 18:04:04 radical.pilot.MainProcess: [INFO    ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots.
2014:10:04 18:04:04 radical.pilot.MainProcess: [DEBUG   ] PilotManager.close(): PilotLauncherWorker-1 terminated.
2014:10:04 18:04:04 radical.pilot.MainProcess: [DEBUG   ] Worker thread (ID: Thread-1[140377090459392]) for PilotManager 54302860e14fa2479af6436b stopped.
2014:10:04 18:04:04 radical.pilot.MainProcess: [INFO    ] Closed PilotManager 54302860e14fa2479af6436b.
2014:10:04 18:04:05 radical.pilot.MainProcess: [DEBUG   ] UnitManager.close(): InputFileTransferWorker-1 terminated.
2014:10:04 18:04:05 radical.pilot.MainProcess: [DEBUG   ] UnitManager.close(): InputFileTransferWorker-2 terminated.
2014:10:04 18:04:05 radical.pilot.MainProcess: [DEBUG   ] UnitManager.close(): OutputFileTransferWorker-1 terminated.
2014:10:04 18:04:05 radical.pilot.MainProcess: [DEBUG   ] UnitManager.close(): OutputFileTransferWorker-2 terminated.
2014:10:04 18:04:05 radical.pilot.MainProcess: [DEBUG   ] Worker thread (ID: Thread-3[140376700610304]) for UnitManager 54302861e14fa2479af6436d stopped.
2014:10:04 18:04:05 radical.pilot.MainProcess: [INFO    ] Closed UnitManager 54302861e14fa2479af6436d.
2014:10:04 18:04:05 radical.pilot.MainProcess: [INFO    ] Deleted session 5430285ee14fa2479af6436a from database.
2014:10:04 18:04:05 radical.pilot.MainProcess: [INFO    ] Closed Session 5430285ee14fa2479af6436a.

and then a number of error messages which (falsely) get shown during shutdown (see https://github.com/radical-cybertools/radical.pilot/issues/310). So to me it seems that the code actually works as intended. The total time for execution seems to be about 1.5 minutes, which sounds about right.

Since you said It does not terminate, I interrupted it after an hour., I assume that you get no output after the last line, which would be

<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'error'

? If that is not the case, would you be able to poinpoint the exact line in the output where it seems to hang?

Also, I am surprised by the statement simple_bot.py is modified from pdesc.resource = "local.localhost" to pdesc.resource = "localhost" -- this seems to indicate a version problem or an installation problem. How was radical.pilot installed? What is the result of radicalpilot-version?

Thanks!

ashkurti commented 9 years ago

Your assumption is right. I get no output after the last line that is: [Callback]: ComputePilot '5430225ae14fa23fd85b64cf' state: Canceled.

But it terminates by itself if the DEBUG is activated.

[ExTASY-toolsOct2] ardita@jekyll 138% radicalpilot-version 2014:10:04 18:33:29 radical.pilot.MainProcess: [INFO ] radical.pilot version: 0.20 v0.20

ashkurti commented 9 years ago

In addition as reported from the output with the debug option activated there is a complaint about the mpi module not found, while the mpi module is there loaded at every open shell. Its version is:

[ExTASY-toolsOct2] ardita@jekyll 139% mpiexec --version Intel(R) MPI Library for Linux* OS, 64-bit applications, Version 4.1.3 Build 20140226 Copyright (C) 2003-2014 Intel Corporation. All rights reserved.

ashkurti commented 9 years ago

I looked at radical-cybertools/radical.pilot#310 but I have the number of cores set at 1: pdesc.cores = 1 ...

andre-merzky commented 9 years ago

I checked version tag v0.20, and at that tag we had the simple_bot.py resource tag as localhost, not as local.localhost. So there I think this is a mismatch between the documentation version you used -- where did you look at the tutorial pages, if I may ask?

andre-merzky commented 9 years ago

But it terminates by itself if the DEBUG is activated.

Uh, that should not happen. I'll try to reproduce this -- would you please let me know how exactly you installed radical.pilot?

Thanks!

ashkurti commented 9 years ago

File simple_bot.py taken from https://raw.githubusercontent.com/radical-cybertools/radical.pilot/devel/examples/tutorial/simple_bot.py and instructions at http://radicalpilot.readthedocs.org/en/latest/tutorial/simple_bot.html were followed.

ashkurti commented 9 years ago

I installed radical.pilot as described at step 2 of section 1 (Installation) at https://github.com/radical-cybertools/ExTASY/blob/devel/README.md

andre-merzky commented 9 years ago

File simple_bot.py taken from https://raw.githubusercontent.com/radical-cybertools/radical.pilot/devel/examples/tutorial/simple_bot.py and instructions at http://radicalpilot.readthedocs.org/en/latest/tutorial/simple_bot.html were followed.

Ah, that explains it. The link you see on the tutorial page you mention, on readthedocs, points you to a different version of the simple_bot.py -- the devel branch contains changes to the resource tags, so does not match that tutorial version. In general, please consider the devel branch to be unstable, for all practical purposes... You should, however, always be able to refer to the master branch (which will in general be in sync with the release).

ashkurti commented 9 years ago

Right, I indicated these links from the beginning of the description of the issue.

In addition the issue is raised based on the devel branch since we are performing the testing of the ExTASY tools at that level ... preparing for the next release candidate ...

Would it be less confusing if we tagged the issues as ExTASY - testing at this point?

andre-merzky commented 9 years ago

Alas, I cannot reproduce the hanging problem -- the same version and same example work fine for me. Even switched my login shell to tcsh ;)

When that code hangs and you press CTRL-C, is there anything like a python stacktrace printed on your terminal?

andre-merzky commented 9 years ago

Right, I indicated these links from the beginning of the description of the issue.

Right again, sorry for missing that blush

Would it be less confusing if we tagged the issues as ExTASY - testing at this point?

I don't mind either way -- just happy that its not a radicalpilot level problem :D The haning code though is an RP problem -- but I think its ok to leave the ticket in this tracker...

ashkurti commented 9 years ago

Ok, I tried now after unsetting the RADICAL_PILOT_VERVBOSE variable twice. It ends now by itself and the last two lines are: [Callback]: ComputeUnit '54306954e14fa25ed15d8c32' state: Scheduling. [Callback]: ComputeUnit '54306954e14fa25ed15d8c2e' state: Done. [Callback]: ComputeUnit '54306954e14fa25ed15d8c32' state: Done. All CUs completed successfully! Closing session, exiting now ... [Callback]: ComputePilot '54306952e14fa25ed15d8c28' state: Canceled. [ExTASY-toolsOct2] ardita@jekyll 148%

ashkurti commented 9 years ago

And again setting the RADICAL_PILOT_VERVBOSE environment variable to DEBUG I get the following last lines ... that differ from what I obtained before...

[Callback]: ComputeUnit '54306a41e14fa2663b9e367b' state: Scheduling. 2014:10:04 22:45:11 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '54306a41e14fa2663b9e367b' state changed from 'Scheduling' to 'Done'. [Callback]: ComputeUnit '54306a41e14fa2663b9e367b' state: Done. All CUs completed successfully! Closing session, exiting now ... 2014:10:04 22:45:11 radical.pilot.MainProcess: [INFO ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots. 2014:10:04 22:45:13 radical.pilot.MainProcess: [INFO ] ComputePilot '54306a40e14fa2663b9e3675' state changed from 'Active' to 'Canceled'. [Callback]: ComputePilot '54306a40e14fa2663b9e3675' state: Canceled. 2014:10:04 22:45:13 radical.pilot.MainProcess: [INFO ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] PilotManager.close(): PilotLauncherWorker-1 terminated. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] Worker thread (ID: Thread-1[140480773818112]) for PilotManager 54306a3fe14fa2663b9e3674 stopped. 2014:10:04 22:45:13 radical.pilot.MainProcess: [INFO ] Closed PilotManager 54306a3fe14fa2663b9e3674. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): InputFileTransferWorker-1 terminated. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): InputFileTransferWorker-2 terminated. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): OutputFileTransferWorker-1 terminated. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): OutputFileTransferWorker-2 terminated. 2014:10:04 22:45:13 radical.pilot.MainProcess: [DEBUG ] Worker thread (ID: Thread-3[140480383805184]) for UnitManager 54306a41e14fa2663b9e3676 stopped. 2014:10:04 22:45:13 radical.pilot.MainProcess: [INFO ] Closed UnitManager 54306a41e14fa2663b9e3676. 2014:10:04 22:45:14 radical.pilot.MainProcess: [INFO ] Deleted session 54306a3de14fa2663b9e3673 from database. 2014:10:04 22:45:14 radical.pilot.MainProcess: [INFO ] Closed Session 54306a3de14fa2663b9e3673. [ExTASY-toolsOct2] ardita@jekyll 150%

andre-merzky commented 9 years ago

Those different exit messages all look ok, really. The important line here is

Closing session, exiting now ...

After that, the debug mode will show a variety of messages which are really only for debugging, and have no bearing on the operation of the code itself (that is all done and finished).

That those debug messages vary is because the shutdown of the multithreaded pilot framework is non-deterministic: depending on what thread gets closed first, the amount and types of messages differ. That is expected.

The only problem I saw so far is that the code seems to sometimes hang for you during that shutdown. If you discover that again, please send the stack traces printed by python after you press CTRL-C -- those should show which parts of the code stopped the regular shutdown.

oleweidner commented 9 years ago

Hi Ardita, am I assuming correctly that things are working for you now after you switched to the non-devel version of radical-pilot and simple-bot.py?

ashkurti commented 9 years ago

Not noticed any anomalous behaviours during this second round of the ExTASY tools testing, while performing local test, although in my case I need to use pdesc.resource = "localhost" instead of pdesc.resource = "local.localhost"

However it is not working for me while trying to execute it on Stampede: Changes at simple_bot.py:

pdesc.resource = "stampede.tacc.utexas.edu" pdesc.project = "TG-MCB090174" c.user_id = "ardi"

Error obtained: 2014:11:04 14:53:30 radical.pilot.MainProcess: [DEBUG ] read : [ 15] [ 55](Shared connection to stampede.tacc.utexas.edu closed.n) 2014:11:04 14:53:30 radical.pilot.MainProcess: [ERROR ] Pilot launching failed: Insufficient system resources: Insufficient system resources: process I/O failed: Shared connection to stampede.tacc.utexas.edu closed. ((Shared connection to stampede.tacc.utexas.edu closed. )) (/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/utils/pty_exceptions.py +56 (translate_exception) : e = se.NoSuccess ("Insufficient system resources: %s" % cmsg)) Traceback (most recent call last): File "/users/ardita/.local/lib/python2.7/site-packages/radical.pilot-0.19-py2.7.egg/radical/pilot/controller/pilot_launcher_worker.py", line 343, in run agent_script.copy("%s/radical-pilot-agent.py" % str(sandbox)) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/namespace/entry.py", line 273, in copy ret = self._adaptor.copy_self (tgt_url, flags, ttype=ttype) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/adaptors/cpi/decorators.py", line 51, in wrap_function return sync_function (self, _args, *_kwargs) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/adaptors/shell/shell_file.py", line 1146, in copy_self copy_shell = self._get_copy_shell (tgt) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/adaptors/shell/shell_file.py", line 919, in _get_copy_shell self.copy_shell = sups.PTYShell (tgt, self.session, self._logger) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/utils/pty_shell.py", line 212, in init self.pty_shell = self.factory.run_shell (self.pty_info) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/utils/pty_shell_factory.py", line 417, in run_shell self._initialize_pty (sh_slave, info, is_shell=True) File "/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/utils/pty_shell_factory.py", line 379, in _initialize_pty raise ptye.translate_exception (e) NoSuccess: Insufficient system resources: Insufficient system resources: process I/O failed: Shared connection to stampede.tacc.utexas.edu closed. ((Shared connection to stampede.tacc.utexas.edu closed. )) (/users/ardita/.local/lib/python2.7/site-packages/saga_python-0.14-py2.7.egg/saga/utils/pty_exceptions.py +56 (translate_exception) : e = se.NoSuccess ("Insufficient system resources: %s" % cmsg))

ashkurti commented 9 years ago

Have removed everything related to radical and saga at the .local/ folders of my local linux workstation from which I am carrying the testing out. The bag of tasks is still failing with the following error:

2014:11:04 18:14:52 radical.pilot.MainProcess: [ERROR ] Pilot launching failed: read from process failed '[Errno 5] Input/output error' : (Connecting to stampede.tacc.utexas.edu... Couldn't read packet: Connection reset by peer ) (/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/utils/pty_process.py +643 (read) : % (e, self.tail))) Traceback (most recent call last): File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 466, in run pilotjob.run() File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/job/job.py", line 397, in run return self._adaptor.run (ttype=ttype) File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 51, in wrap_function return sync_function (self, _args, *_kwargs) File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/adaptors/slurm/slurm_job.py", line 1190, in run self._id = self.js._job_run (self.jd) File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/adaptors/slurm/slurm_job.py", line 580, in _job_run self.shell.stage_to_remote (src=fname, tgt=tgt) File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/utils/pty_shell.py", line 900, in stage_to_remote raise ptye.translate_exception (e) NoSuccess: read from process failed '[Errno 5] Input/output error' : (Connecting to stampede.tacc.utexas.edu... Couldn't read packet: Connection reset by peer ) (/users/ardita/ExTASY-tools/lib/python2.7/site-packages/saga/utils/pty_process.py +643 (read) : % (e, self.tail)))

2014:11:04 18:14:52 radical.pilot.MainProcess: [INFO ] ComputePilot '5459178bf8cdba48d668cdcf' state changed from 'Launching' to 'Failed'. [Callback]: ComputePilot '5459178bf8cdba48d668cdcf' state: Failed. 2014:11:04 18:14:52 radical.pilot.MainProcess: [ERROR ] pilot manager controller thread caught system exit -- forcing application shutdown Traceback (most recent call last): File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 293, in run self.call_callbacks(pilot_id, new_state) File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 211, in call_callbacks cb(self._shared_data[pilot_id]['facade_object'](), new_state) File "simple_bot.py", line 24, in pilot_state_cb sys.exit (1) SystemExit: 1 Execution was interrupted Closing session, exiting now ...

andre-merzky commented 9 years ago

Can you, on command line, try to login to stampede? Is that working w/o asking for a password?

If that is the case, please run the application once more after setting

RADICAL_PILOT_VERBOSE=DEBUG
SAGA_VERBOSE=DEBUG

and send the resulting (large) output per email.

Thanks!

shantenujha commented 9 years ago

Andre: In case this helps: This was discussed in today's ExTASY release meeting. Ole detected that SAGA 0.14 was being used.

Can you, on command line, try to login to stampede? Is that working w/o asking for a password?

If that is the case, please run the application once more after setting

RADICAL_PILOT_VERBOSE=DEBUG SAGA_VERBOSE=DEBUG

and send the resulting (large) output per email.

Thanks!

— Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/ExTASY/issues/71#issuecomment-61746426.

andre-merzky commented 9 years ago

Thanks for the feedback!

ashkurti commented 9 years ago

My last comment related to a problem verified after the call, after specifically removing everything related to radical and saga from the .local folder of the local workstation. In this new case the latest saga version has been used ... I will shortly post the large output!

ashkurti commented 9 years ago

I have pasted the output including the debug at https://gist.github.com/ashkurti/4a3924825b57a7f6a5d4 for the bag of tasks and https://gist.github.com/ashkurti/acb47f5d3b4f7428e4b4 for the coco-amber workflow, that presents the same problem.

And yes I have a passwordless access to stampede:

[ExTASY-tools] ardita@tirith 125% ssh ardi@login3.stampede.tacc.utexas.edu
Last login: Tue Nov  4 13:22:56 2014 from poirot.pharm.nottingham.ac.uk
------------------------------------------------------------------------------
                   Welcome to the Stampede Supercomputer
      Texas Advanced Computing Center, The University of Texas at Austin
------------------------------------------------------------------------------

              ** Unauthorized use/access is prohibited. **

If you log on to this computer system, you acknowledge your awareness
of and concurrence with the UT Austin Acceptable Use Policy. The
University will prosecute violators to the full extent of the law.

TACC Usage Policies:
http://www.tacc.utexas.edu/user-services/usage-policies/
______________________________________________________________________________

Questions and Problem Reports:

--> XD Projects:     help@xsede.org (email)
--> TACC Projects:   portal.tacc.utexas.edu (web)

Documentation:  http://www.tacc.utexas.edu/user-services/user-guides/
User News:      http://www.tacc.utexas.edu/user-services/user-news/
______________________________________________________________________________

Welcome to Stampede, *please* read these important system notes:

--> Stampede is currently running the SLURM resource manager to
    schedule all compute resources. Example SLURM job scripts are
    available on the system at /share/doc/slurm

    To run an interactive shell, issue:
          srun -p development -t 0:30:00 -n 32 --pty /bin/bash -l

    To submit a batch job, issue:       sbatch job.mpi
    To show all queued jobs, issue:     showq
    To kill a queued job, issue:        scancel <jobId>

    See "man slurm" or the Stampede user guide for more detailed information.

--> To see all the software that is available across all compilers and
    mpi stacks, issue: "module spider"

--> To see which software packages are available with your currently loaded
    compiler and mpi stack, issue: "module avail"

--> Stampede has three parallel file systems: $HOME (permanent,
    quota'd, backed-up) $WORK (permanent, quota'd, not backed-up) and
    $SCRATCH (high-speed purged storage). The "cdw" and "cds" aliases
    are provided as a convenience to change to your $WORK and $SCRATCH
    directories, respectively.
______________________________________________________________________________

----------------------- Project balances for user ardi ------------------------
| Name           Avail SUs     Expires | Name           Avail SUs     Expires |
| TG-MCB090174       79220  2015-09-30 | TG-TRA140016      -81080  2015-05-06 |
-------------------------- Disk quotas for user ardi --------------------------
| Disk         Usage (GB)     Limit    %Used   File Usage       Limit   %Used |
| /home1              0.4       5.0     8.06        16582      150000   11.05 |
| /work               0.0    1024.0     0.00         2683     3000000    0.09 |
-------------------------------------------------------------------------------

Tip 38   (See "module help tacc_tips" for features or how to disable)

   No need to retype previous commands; use Ctrl+R to search for them.

login3.stampede(1)$