radical-cybertools / radical.owms

Tiered Resource OverlaY
Other
0 stars 1 forks source link

Critical: Second run, tutorial part 2 fails #65

Open mturilli opened 10 years ago

mturilli commented 10 years ago

The second run of part2 fails. After operating the following change in

#strategy = session.cfg.get ('troy_strategy', troy.AUTOMATIC)
strategy = 'basic_early_binding'

the following command fails:

python tutorial_02.py config_application.json config_troy.json

Here the relevant portions of the logs:

2014:02:27 14:52:06 MainThread   troy.logger           : [WARNING ] Could not reconnect to pilot p.0001 (None)
Merging dicts ()
2014:02:27 14:52:06 MainThread   troy.logger           : [INFO    ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_8/ / topol.tpr
2014:02:27 14:52:10 MainThread   troy.logger           : [WARNING ] Could not reconnect to pilot p.0002 (None)
Merging dicts ()
2014:02:27 14:52:10 MainThread   troy.logger           : [INFO    ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_9/ / topol.tpr
2014:02:27 14:52:12 MainThread   troy.logger           : [WARNING ] Could not reconnect to pilot p.0003 (None)
Merging dicts ()
2014:02:27 14:52:12 MainThread   troy.logger           : [INFO    ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_2/ / topol.tpr
2014:02:27 14:52:14 MainThread   troy.logger           : [WARNING ] Could not reconnect to pilot p.0001 (None)
Merging dicts ()
2014:02:27 14:52:14 MainThread   troy.logger           : [INFO    ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_3/ / topol.tpr
2014:02:27 14:52:16 MainThread   troy.logger           : [WARNING ] Could not reconnect to pilot p.0002 (None)
Merging dicts ()
2014:02:27 14:52:16 MainThread   troy.logger           : [INFO    ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_4/ / topol.tpr
2014:02:27 14:52:18 MainThread   troy.logger           : [WARNING ] Could not reconnect to pilot p.0003 (None)
Merging dicts ()
2014:02:27 14:52:18 MainThread   troy.logger           : [INFO    ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_5/ / topol.tpr
2014:02:27 14:52:20 MainThread   troy.logger           : [WARNING ] Could not reconnect to pilot p.0001 (None)
Merging dicts ()
2014:02:27 14:52:20 MainThread   troy.logger           : [INFO    ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_0/ / topol.tpr
2014:02:27 14:52:21 MainThread   troy.logger           : [WARNING ] Could not reconnect to pilot p.0002 (None)
Merging dicts ()
2014:02:27 14:52:21 MainThread   troy.logger           : [INFO    ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_1/ / topol.tpr
2014:02:27 14:52:23 MainThread   troy.logger           : [WARNING ] Could not reconnect to pilot p.0003 (None)
Merging dicts ()
2014:02:27 14:52:23 MainThread   troy.logger           : [INFO    ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_6/ / topol.tpr
2014:02:27 14:52:25 MainThread   troy.logger           : [WARNING ] Could not reconnect to pilot p.0001 (None)
Merging dicts ()
2014:02:27 14:52:25 MainThread   troy.logger           : [INFO    ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_7/ / topol.tpr
Merging dicts ()
Merging dicts ()
Merging dicts ()
2014:02:27 14:52:27 MainThread   troy.logger           : [INFO    ] overlay  provision: provision   pilot  p.0001 : pbs+ssh://sierra.futuregrid.org 
added username mturilli @ sierra.futuregrid.org
2014:02:27 14:52:29 MainThread   troy.logger           : [INFO    ] overlay  provision: provisioned pilot  <class 'troy.overlay.pilot.Pilot'> {'native_description': <sagapilot.compute_pilot_description.ComputePilotDescription object at 0x10c1441d0>, 'uid': '530f977cd8b1954664621c9a', 'unit_managers': [], 'resource_detail': {'cores_per_node': None, 'nodes': None}, 'native_id': ['530f977bd8b1954664621c98', '530f977cd8b1954664621c99', '530f977cd8b1954664621c9a'], 'session': <troy.session.Session object at 0x10b433710>, 'timed_type': 'troy.Pilot', 'id': 'p.0001', 'size': 4, 'unit_ids': [], 'log': [u'Pilot Job submission failed:\n [Errno 10] No child processes'], 'overlay': <troy.overlay.overlay.Overlay object at 0x10b603ad0>, 'state': 'Failed', 'stop_time': None, 'timed_events_known': ["timed_create ['p.0001']", "submission ['sagapilot']", "state_detail ['sagapilot', u'Pilot Job submission failed:\\n [Errno 10] No child processes']"], 'nodes': None, 'walltime': 600, 'native_resource': 'sierra.futuregrid.org', 'description': <troy.overlay.pilot_description.PilotDescription object at 0x10c144510>, 'timed_durations_known': [], 'cores_per_node': None, 'start_time': None, 'timed_events': [{'tags': ['p.0001'], 'event': 'timed_create', 'time': datetime.datetime(2014, 2, 27, 19, 52, 6, 129231)}, {'tags': ['sagapilot'], 'event': 'submission', 'time': datetime.datetime(2014, 2, 27, 19, 52, 28, 565000)}, {'tags': ['sagapilot', u'Pilot Job submission failed:\n [Errno 10] No child processes'], 'event': 'state_detail', 'time': None}], 'submission_time': datetime.datetime(2014, 2, 27, 19, 52, 28, 565000), 'timed_components': {}, 'timed_id': 'p.0001', 'resource': u'pbs+ssh://sierra.futuregrid.org', 'pilot_manager': <sagapilot.pilot_manager.PilotManager object at 0x10c1961d0>, 'queue': None, 'sandbox': '/N/u/mturilli//troy_agents/', 'cores': 4, 'runtime': 600, 'timed_durations': []} : {'state': u'Failed', 'resource': u'sierra.futuregrid.org', 'uid': '530f977cd8b1954664621c9a', 'submission_time': datetime.datetime(2014, 2, 27, 19, 52, 28, 565000), 'stop_time': None, 'start_time': None, 'resource_detail': {'cores_per_node': None, 'nodes': None}, 'sandbox': u'sftp://sierra.futuregrid.org/N/u/mturilli/troy_agents/pilot-530f977cd8b1954664621c9a/', 'log': [u'Pilot Job submission failed:\n [Errno 10] No child processes']} (pbs+ssh://sierra.futuregrid.org)
andre-merzky commented 10 years ago

I think this is related to the problem Ole saw in sagapilot -- I'll report back once that is addressed.