radical-cybertools / radical.owms

Tiered Resource OverlaY
Other
0 stars 1 forks source link

Tutorial 01 fails on stampede #71

Open mturilli opened 10 years ago

mturilli commented 10 years ago

Opened a ticket with a detailed description of the issue on sagapilot: https://github.com/saga-project/saga-pilot/issues/98

mturilli commented 10 years ago

Here I get a new error after running it from my laptop towards stampede (the previous reports were from sierra towards stampede):

added username mturilli @ stampede.tacc.utexas.edu
2014:03:06 18:14:00 MainThread   troy.logger           : [CRITICAL] strategy execution failed: [Errno 24] Too many open files
2014:03:06 18:14:00 MainThread   troy.logger           : [WARNING ] shutting down workload: wl.0001
2014:03:06 18:14:00 MainThread   troy.logger           : [WARNING ] shutting down overlay: ol.0001
Exception in thread PMWThread-5319013378f3cbfbff781c98 (most likely raised during interpreter shutdown):Exception in thread UMWThread-5319012e78f3cbfbff781c8b (most likely raised during interpreter shutdown):Exception in thread UMWThread-5319013478f3cbfbff781c9a (most likely raised during interpreter shutdown):Exception in thread UMWThread-5319012878f3cbfbff781c7c (most likely raised during interpreter shutdown):

Traceback (most recent call last):Exception in thread PMWThread-5319012778f3cbfbff781c7a (most likely raised during interpreter shutdown):
Traceback (most recent call last):Traceback (most recent call last):

Note the 'too many files open'. It looks like a problem with the strategy plugin. Here my conf_application.json

# config_application.json
{
    # variables we want to vary for each experiment run
    "steps"            : 256,
    "bag_size"         : 64,

    # build up a unique session id from those variables.  This 
    # ID will be used by try to identify this run
    "session_id"       : "w-0_bag-64_run-0_gromacs_%(steps)s",

    # We add some additional, app specific information to the 
    # troy resource configuration, so that we can use placeholder
    # like '%(mdrun)s' in our workload descriptions.
    # This section *must* be named `resources`.
    "resources" : {
        #"*.futuregrid.org" : {
        #    "username"     : "mturilli",
        #    "mdrun"        : "/N/u/marksant/bin/mdrun"
        #},
        "stampede.*" : {
            "username"     : "mturilli",
            "home"         : "/home1/02855/mturilli",
            "mdrun"        : "/home1/01740/marksant/bin/mdrun"
        },
        # localhost has mdrun in path
        "localhost" : {
            "mdrun"        : "mdrun"
        }
    }
}

From the logs, I see that it does create 64 tasks as instructed:

{'stdout': None, 
'tag': None, 
u'working_directory': '%(home)s/AIMES-SC2014-experiments/w-0_bag-64_run-0_task-%(cardinal)s/', 
'timed_type': 'troy.Task', 'id': 't.0001', 
'executable': '%(mdrun)s', 
'state': 'Described', 
'cardinal': 63, 
'arguments': [], 
'units': {}, 
'timed_events_known': ['timed_create [t.0001] []'], 
'walltime': 0.0, 'timed_events': [{'time': datetime.datetime(2014, 3, 6, 23, 13, 40, 176559),
'event': 'timed_create', 'name': 't.0001', 'tags': []}], 
'inputs': [u'input/topol.tpr > topol.tpr'], 
'description': <saga.attributes.Attributes object at 0x1065262d0>, 
'timed_durations_known': [], 
'outputs': [u'output/w-0_bag-64_run-0_gromacs_256_state.cpt.63   < state.cpt', u'output/w-0_bag-64_run-0_gromacs_256_confout.gro.63 < confout.gro', u'output/w-0_bag-64_run-0_gromacs_256_ener.edr.63    < ener.edr', u'output/w-0_bag-64_run-0_gromacs_256_traj.trr.63    < traj.trr', u'output/w-0_bag-64_run-0_gromacs_256_md.log.63      < md.log'], 
'timed_durations': [], 'cardinality': 1, 
'timed_id': 't.0001', 'workload': <saga.attributes.Attributes object at 0x106526310>, 'timed_components': {}, 
'stdin': None, 
'session': <troy.session.Session object at 0x10576c710>, 
'cores': 1}
mturilli commented 10 years ago

For completeness. Here the relevant bit of config_troy.json

    # frequently changing variables
    "hosts"         : "slurm+ssh://stampede.tacc.utexas.edu",
    "pilot_size"    : "1",
    "concurrency"   : "100",
    "pilot_backend" : "sagapilot",
    "troy_strategy" : "basic_late_binding",

And the workload.json

{
  "tasks" : [
    {
      "cardinality"       : "%(bag_size)s",
      "executable"        : "%(mdrun)s",
      "working_directory" : "%(home)s/AIMES-SC2014-experiments/w-0_bag-64_run-0_task-%(cardinal)s/",
      "inputs"            : ["input/topol.tpr > topol.tpr"],
      "outputs"           : ["output/%(session_id)s_state.cpt.%(cardinal)s   < state.cpt",
                             "output/%(session_id)s_confout.gro.%(cardinal)s < confout.gro",
                             "output/%(session_id)s_ener.edr.%(cardinal)s    < ener.edr",
                             "output/%(session_id)s_traj.trr.%(cardinal)s    < traj.trr",
                             "output/%(session_id)s_md.log.%(cardinal)s      < md.log"]
    }
  ]
}
mturilli commented 10 years ago

Here the installation process:

virtualenv ~/Virtualenvs/AIMES-SC2014
. ~/Virtualenvs/AIMES-SC2014/bin/activate
pip install --upgrade -e git://github.com/saga-project/troy.git@devel#egg=troy
pip install --upgrade -e git://github.com/saga-project/saga-pilot.git@master#egg=saga-pilot
pip install --upgrade -e git://github.com/saga-project/radical.utils.git@devel#egg=radical.utils

And, finally, the command:

export TROY_VERBOSE=DEBUG
export SAGAPILOT_VERBOSE=DEBUG
date; time ./tutorial_01.py workload_gromacs.json config_application.json config_troy.json > runs/w-0_bag-64_run-0.run 2>&1
andre-merzky commented 10 years ago

quick update:

I think we are piling up on ssh connections on saga-pilot level, due to script staging, and the pilot spawning thread barfs up. The sagapilot call seems to call close() all right, I'll check why that is not closing the channels...

mturilli commented 10 years ago

The problem was uncovered - i.e. not determined - by failing to understand that:

"pilot_size": "1" 

Indicates the size in cores of a pilot. This pushed TROY to try to create 64 pilots (1 for each task, 1 core each), something that does not go well with the 20 jobs limits on stampede.

NOTE: The debugging process outlined other missing configuration and issues (i.e. we need to use the module concurrent for plugin_planner_derive). Without comprehensive documentation about all the configuration options, the only way I have to use TROY is either to go back to the source or, if I need to run something in a reasonable time frame, to bug mercilessly Andre. Poor him, poor me! :)