radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

logging in to bluewaters #983

Closed euhruska closed 8 years ago

euhruska commented 8 years ago

I seem to have problems using radical.pilot with bluewaters. It shows an error like it can't connect. But I can connect as usual to bluewaters with gsissh hruska@bw.ncsa.illinois.edu. Though it says /Users/eh/Library/Globus/etc/ssh not found. in the first line before connecting. my $GLOBUS_LOCATION is /Users/eh/Library/Globus The error message is when running radical.pilot:

================================================================================
 EnsembleMD (0.3.14)
================================================================================

Starting Allocation2016-03-11 13:44:41,456: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : INFO    : Requesting resources on ncsa.bw
2016-03-11 13:44:48,340: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : INFO    : Resource ncsa.bw state has changed to PendingLaunch
2016-03-11 13:44:49,122: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: ERROR   : Pilot launching failed! (schema ssh unknown for resource ncsa.bw)
Traceback (most recent call last):
  File "/Users/eh/myenv7/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 381, in run
    resource_cfg = self._session.get_resource_config(resource_key, schema)
  File "/Users/eh/myenv7/lib/python2.7/site-packages/radical/pilot/session.py", line 604, in get_resource_config
    % (schema, resource_key))
RuntimeError: schema ssh unknown for resource ncsa.bw
2016-03-11 13:44:49,132: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : INFO    : Launched 64-core pilot on ncsa.bw.
                                                           ok2016-03-11 13:44:49,132: radical.enmd.Engine : MainProcess                     : MainThread     : INFO    : Selected execution plug-in 'simulation_analysis_loop.static.default' for pattern 'SimulationAnalysisLoop' and context type 'Static'.

        Verifying pattern                                                     ok
        Starting pattern execution                                            ok2016-03-11 13:44:49,133: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : MainThread     : INFO    : Executing simulation-analysis loop with 1 iterations on 64 allocated core(s) on 'ncsa.bw'

--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 64 allocated core(s) on 'ncsa.bw'

2016-03-11 13:44:49,133: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : MainThread     : INFO    : Waiting for pilot on ncsa.bw to go Active
Job waiting on queue...2016-03-11 13:44:50,144: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : INFO    : Resource ncsa.bw state has changed to Failed
2016-03-11 13:44:50,144: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Resource error: Pilot launching failed! (schema ssh unknown for resource ncsa.bw)
2016-03-11 13:44:50,144: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Pattern execution FAILED.
2016-03-11 13:44:50,144: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : sys.exit from callback
Traceback (most recent call last):
  File "/Users/eh/myenv7/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
    cb(self._shared_data[pilot_id]['facade_object'](), new_state)
  File "/Users/eh/myenv7/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 127, in pilot_state_cb
    sys.exit(2)
SystemExit: 2
2016-03-11 13:44:50,639: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : ERROR   : Fatal error during execution: .
Fatal error during execution: .Starting Deallocation2016-03-11 13:44:50,639: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : INFO    : Deallocating Cluster
2016-03-11 13:44:50,639: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : ERROR   : Fatal error during execution: .
Fatal error: .  File "/Users/eh/myenv7/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 244, in run
    plugin.execute_pattern(pattern, self)
  File "/Users/eh/myenv7/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 156, in execute_pattern
    resource._pmgr.wait_pilots(resource._pilot.uid,'Active')
  File "/Users/eh/myenv7/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 532, in wait_pilots
    time.sleep(0.5)

This worked before and I didn't change Globus toolkit, just update radical installation. What is the problem?

andre-merzky commented 8 years ago
RuntimeError: schema ssh unknown for resource ncsa.bw

It seems to use ssh instead of gsissh. I am not sure whats up. What version of RP are you using? (radicalpilot-version)

andre-merzky commented 8 years ago

@vivek-bala : vivek, could that be a misconfiguration on the EnMD layer?

euhruska commented 8 years ago

radicalpilot-version 0.40.1

andre-merzky commented 8 years ago

Yeah, ssh is not supported in that version, so the message complains about the right thing at least :) We'll have to wait for Vivek to reply I'm afraid, not sure where ssh instead of gsissh is picked up...

andre-merzky commented 8 years ago

I think you can add this to the arguments of the SingleClusterEnvironment:

access_schema = 'gsissh'

I am not sure what exactly you run, and if that SingleClusterEnvironment lives in your part of the code at all...

euhruska commented 8 years ago

Good, might have fixed this. At least, it doesn't throw this error anymore, currently waiting in queue.

vivek-bala commented 8 years ago

if not unspecified, schema defaults to None.

Yeah, ssh is not supported in that version, so the message complains about the right thing at least :)

I don't understand. ssh is not supported ??

vivek-bala commented 8 years ago

Hmmm, good to know. I have tried on BW, it seems to be working unspecified. Thinking why it doesn't for you.

andre-merzky commented 8 years ago

I don't understand. ssh is not supported ??

Have a look at https://github.com/radical-cybertools/radical.pilot/blob/devel/src/radical/pilot/configs/resource_ncsa.json#L27 -- you'll see that this RP level config only defines gsissh endpoints for bw, not plain ssh.

vivek-bala commented 8 years ago

Ah yes yes. You meant for BW, got confused :).

vivek-bala commented 8 years ago

If schema is set to None from EnMD, RP should default to gsissh in the case of BW. Right ? @euhruska Eugen, so it works only if you specify for access_schema to gsissh in your case ?

euhruska commented 8 years ago

Yes, before the access_schema was unspecified and it didn't work. I only added that one line.

andre-merzky commented 8 years ago

If schema is set to None from EnMD, RP should default to gsissh in the case of BW. Right ?

Yes, and that seems to work on on RP level:

So my assumption would be that EnMD falls back to ssh? If that is not the case, I should start looking where we go wrong. Could you print the pilot description before submission?

vivek-bala commented 8 years ago

Hmmm, I'm not sure why it would happen though.

https://github.com/radical-cybertools/radical.ensemblemd/blob/master/src/radical/ensemblemd/single_cluster_environment.py#L228 where it is assigned to the pilot. Lines 44, 65 of interest as well.

vivek-bala commented 8 years ago

Ah wait. @euhruska I think you might be using the released version of EnMD. Could you confirm the version please ? If so, please use the master branch.

vivek-bala commented 8 years ago
================================================================================
 EnsembleMD (0.3.14)
================================================================================

Ok, that's the released version. Please try with the master branch.

euhruska commented 8 years ago

sure

euhruska commented 8 years ago

How do I install the master branch with pip?

vivek-bala commented 8 years ago

pip install --upgrade git+https://github.com/radical-cybertools/radical.ensemblemd.git@master#egg=radical.ensemblemd