Open mturilli opened 10 years ago
The second run of part2 fails. After operating the following change in
#strategy = session.cfg.get ('troy_strategy', troy.AUTOMATIC) strategy = 'basic_early_binding'
the following command fails:
python tutorial_02.py config_application.json config_troy.json
Here the relevant portions of the logs:
2014:02:27 14:52:06 MainThread troy.logger : [WARNING ] Could not reconnect to pilot p.0001 (None) Merging dicts () 2014:02:27 14:52:06 MainThread troy.logger : [INFO ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_8/ / topol.tpr 2014:02:27 14:52:10 MainThread troy.logger : [WARNING ] Could not reconnect to pilot p.0002 (None) Merging dicts () 2014:02:27 14:52:10 MainThread troy.logger : [INFO ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_9/ / topol.tpr 2014:02:27 14:52:12 MainThread troy.logger : [WARNING ] Could not reconnect to pilot p.0003 (None) Merging dicts () 2014:02:27 14:52:12 MainThread troy.logger : [INFO ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_2/ / topol.tpr 2014:02:27 14:52:14 MainThread troy.logger : [WARNING ] Could not reconnect to pilot p.0001 (None) Merging dicts () 2014:02:27 14:52:14 MainThread troy.logger : [INFO ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_3/ / topol.tpr 2014:02:27 14:52:16 MainThread troy.logger : [WARNING ] Could not reconnect to pilot p.0002 (None) Merging dicts () 2014:02:27 14:52:16 MainThread troy.logger : [INFO ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_4/ / topol.tpr 2014:02:27 14:52:18 MainThread troy.logger : [WARNING ] Could not reconnect to pilot p.0003 (None) Merging dicts () 2014:02:27 14:52:18 MainThread troy.logger : [INFO ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_5/ / topol.tpr 2014:02:27 14:52:20 MainThread troy.logger : [WARNING ] Could not reconnect to pilot p.0001 (None) Merging dicts () 2014:02:27 14:52:20 MainThread troy.logger : [INFO ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_0/ / topol.tpr 2014:02:27 14:52:21 MainThread troy.logger : [WARNING ] Could not reconnect to pilot p.0002 (None) Merging dicts () 2014:02:27 14:52:21 MainThread troy.logger : [INFO ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_1/ / topol.tpr 2014:02:27 14:52:23 MainThread troy.logger : [WARNING ] Could not reconnect to pilot p.0003 (None) Merging dicts () 2014:02:27 14:52:23 MainThread troy.logger : [INFO ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_6/ / topol.tpr 2014:02:27 14:52:25 MainThread troy.logger : [WARNING ] Could not reconnect to pilot p.0001 (None) Merging dicts () 2014:02:27 14:52:25 MainThread troy.logger : [INFO ] staging_in input/topol.tpr < pbs+ssh://sierra.futuregrid.org / /N/u/mturilli//troy_tutorial/troy_tutorial_02_7/ / topol.tpr Merging dicts () Merging dicts () Merging dicts () 2014:02:27 14:52:27 MainThread troy.logger : [INFO ] overlay provision: provision pilot p.0001 : pbs+ssh://sierra.futuregrid.org added username mturilli @ sierra.futuregrid.org 2014:02:27 14:52:29 MainThread troy.logger : [INFO ] overlay provision: provisioned pilot <class 'troy.overlay.pilot.Pilot'> {'native_description': <sagapilot.compute_pilot_description.ComputePilotDescription object at 0x10c1441d0>, 'uid': '530f977cd8b1954664621c9a', 'unit_managers': [], 'resource_detail': {'cores_per_node': None, 'nodes': None}, 'native_id': ['530f977bd8b1954664621c98', '530f977cd8b1954664621c99', '530f977cd8b1954664621c9a'], 'session': <troy.session.Session object at 0x10b433710>, 'timed_type': 'troy.Pilot', 'id': 'p.0001', 'size': 4, 'unit_ids': [], 'log': [u'Pilot Job submission failed:\n [Errno 10] No child processes'], 'overlay': <troy.overlay.overlay.Overlay object at 0x10b603ad0>, 'state': 'Failed', 'stop_time': None, 'timed_events_known': ["timed_create ['p.0001']", "submission ['sagapilot']", "state_detail ['sagapilot', u'Pilot Job submission failed:\\n [Errno 10] No child processes']"], 'nodes': None, 'walltime': 600, 'native_resource': 'sierra.futuregrid.org', 'description': <troy.overlay.pilot_description.PilotDescription object at 0x10c144510>, 'timed_durations_known': [], 'cores_per_node': None, 'start_time': None, 'timed_events': [{'tags': ['p.0001'], 'event': 'timed_create', 'time': datetime.datetime(2014, 2, 27, 19, 52, 6, 129231)}, {'tags': ['sagapilot'], 'event': 'submission', 'time': datetime.datetime(2014, 2, 27, 19, 52, 28, 565000)}, {'tags': ['sagapilot', u'Pilot Job submission failed:\n [Errno 10] No child processes'], 'event': 'state_detail', 'time': None}], 'submission_time': datetime.datetime(2014, 2, 27, 19, 52, 28, 565000), 'timed_components': {}, 'timed_id': 'p.0001', 'resource': u'pbs+ssh://sierra.futuregrid.org', 'pilot_manager': <sagapilot.pilot_manager.PilotManager object at 0x10c1961d0>, 'queue': None, 'sandbox': '/N/u/mturilli//troy_agents/', 'cores': 4, 'runtime': 600, 'timed_durations': []} : {'state': u'Failed', 'resource': u'sierra.futuregrid.org', 'uid': '530f977cd8b1954664621c9a', 'submission_time': datetime.datetime(2014, 2, 27, 19, 52, 28, 565000), 'stop_time': None, 'start_time': None, 'resource_detail': {'cores_per_node': None, 'nodes': None}, 'sandbox': u'sftp://sierra.futuregrid.org/N/u/mturilli/troy_agents/pilot-530f977cd8b1954664621c9a/', 'log': [u'Pilot Job submission failed:\n [Errno 10] No child processes']} (pbs+ssh://sierra.futuregrid.org)
I think this is related to the problem Ole saw in sagapilot -- I'll report back once that is addressed.
The second run of part2 fails. After operating the following change in
the following command fails:
Here the relevant portions of the logs: