radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

Bag of tasks seems to fail remotely #72

Closed ashkurti closed 10 years ago

ashkurti commented 10 years ago

Among the comments of the simple_bot.py file, I found the following link http://radicalpilot.readthedocs.org/en/latest/machconf.html#preconfigured-resources that should give vision on how to create a configuration file, but does not give any instructions on how to use the configuration file with the bag of tasks example.

I find the indications in the comments of the simple_bot.py file not fully clear on what to modify to run the bag of tasks on stampede (i.e. launching it from my local linux workstation).

Following the indications at the comments of the simple_bot.py file (I am not sure if this is sufficient) I modified/added the following lines to simple_bot.py: line 52: c.user_id = "ardi" lines 80-84: pdesc.resource = "stampede.tacc.utexas.edu" pdesc.project = "TG-MCB090174" pdesc.runtime = 10 # minutes pdesc.cores = 1 pdesc.cleanup = True

I should add that I have set a passwordless connection to stampede from my linux workstation.

After adding the just shown lines I executed the bag of tasks and the output is at https://gist.github.com/ashkurti/76a761d05ac71a6848dc.

andre-merzky commented 10 years ago

That looks in fact correct, and again finishes with

2014:10:04 18:30:54 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54302e90e14fa24f7d66d79a' state changed from 'Executing' to 'Done'.
[Callback]: ComputeUnit  '54302e90e14fa24f7d66d79a' state: Done.
All CUs completed successfully!
Closing session, exiting now ...

etc. What is the problem you opened the ticket for? I assume its hanging again after that? In that case though it would be the same problem as the previous ticket #71 I think?

Thanks, Andre.

ashkurti commented 10 years ago

Yes, you are right it seems the same as the previous ticket :)

However, as I was wondering above it would be nice to have the instructions on how to use a configuration file for Stampede for example, without having to modify the simple_bot.py file.

andre-merzky commented 10 years ago

By setting the pilot description's resource tag, you are using the configuration file for stampede! That resource tag is internally used to look up the correct configuration settings for the respective backendhost. For normal use on XSEDE, you should not need to add any configuration files yourself.

Are there any other hosts which you need supported? You'll find our current list of available configurations here.

andre-merzky commented 10 years ago

PS.: I do not know if ExTASY has its own set of configurations though -- so I am not sure if my reply above is completely applicable to your code...

ashkurti commented 10 years ago

Thanks for your comments.

By setting the pilot description's resource tag, you are using the configuration file for stampede!

Right, so in this case the configuration file for stampede is not a separate file but it is included in the simple_bot.py file ... That is fine, I was just trying to understand how this works and whether there is an alternative on using simple_bot.py with a separate/additional configuration file.

PS.: I do not know if ExTASY has its own set of configurations though -- so I am not sure if my reply > above is completely applicable to your code...

That's fine, we are testing different parts of ExTASY separately, we use separate configuration files for the AMBER/CoCo and GROMACS/LSDMap workflows.

ashkurti commented 10 years ago

Trying again, I get a different output with the DEBUG set. Here are the last lines:

[Callback]: ComputeUnit '54306d1ae14fa26dc7666cc8' state: Done. All CUs completed successfully! Closing session, exiting now ... 2014:10:04 22:57:45 radical.pilot.MainProcess: [INFO ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots. 2014:10:04 22:57:45 radical.pilot.MainProcess: [INFO ] ComputePilot '54306d13e14fa26dc7666cbe' state changed from 'Active' to 'Canceled'. [Callback]: ComputePilot '54306d13e14fa26dc7666cbe' state: Canceled. 2014:10:04 22:57:46 radical.pilot.MainProcess: [INFO ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots. 2014:10:04 22:57:46 radical.pilot.MainProcess: [DEBUG ] PilotManager.close(): PilotLauncherWorker-1 terminated. 2014:10:04 22:57:46 radical.pilot.MainProcess: [DEBUG ] Worker thread (ID: Thread-1[139820297524992]) for PilotManager 54306d13e14fa26dc7666cbd stopped. 2014:10:04 22:57:46 radical.pilot.MainProcess: [INFO ] Closed PilotManager 54306d13e14fa26dc7666cbd. 2014:10:04 22:57:46 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): InputFileTransferWorker-1 terminated. 2014:10:04 22:57:46 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): InputFileTransferWorker-2 terminated. 2014:10:04 22:57:46 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): OutputFileTransferWorker-1 terminated. 2014:10:04 22:57:46 radical.pilot.MainProcess: [DEBUG ] UnitManager.close(): OutputFileTransferWorker-2 terminated. 2014:10:04 22:57:46 radical.pilot.MainProcess: [DEBUG ] Worker thread (ID: Thread-3[139819908855552]) for UnitManager 54306d19e14fa26dc7666cbf stopped. 2014:10:04 22:57:46 radical.pilot.MainProcess: [INFO ] Closed UnitManager 54306d19e14fa26dc7666cbf. 2014:10:04 22:57:47 radical.pilot.MainProcess: [INFO ] Deleted session 54306d11e14fa26dc7666cbc from database. 2014:10:04 22:57:47 radical.pilot.MainProcess: [INFO ] Closed Session 54306d11e14fa26dc7666cbc.

I do not understand the different behaviour but here we are ...

Anyway, should this last output be considered as regular (I mean the compute pilot state changed from 'Active' to 'Canceled')?

andre-merzky commented 10 years ago

The same as on #71 applies: all messages after Closing session, exiting now ... are debug logs for the shutdown, and look ok to me.

The pilot state CANCELED after ACTIVE is expected, and is triggered in simple_bot.py from this part (shortened):

        print "Waiting for CUs to complete ..."
        umgr.wait_units()
        print "All CUs completed successfully!"
[...]
        print "Closing session, exiting now ..."
        session.close()

Closing the session will send a cancel() command to all pilots, all threads and all managers.

oleweidner commented 10 years ago

If you run the script without RADICAL_PILOT_VERBOSE set, you won't see any of the confusing debug output and you'll just see the output of the simple-bot script itself.

Generally, I would only set RADICAL_PILOT_VERBOSE if something repeatedly goes wrong and you need more informations.

oleweidner commented 10 years ago

Hi Ardita, I assume this example generally works for you now?

ashkurti commented 9 years ago

Not noticed any anomalous behaviours during this second round of the ExTASY tools testing!!