radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Setting up gsissh on ubuntu #66

Closed Weiming-Hu closed 5 years ago

Weiming-Hu commented 5 years ago

Hi there,

I've tried to fix this by myself but I couldn't find very much useful information online. I was using our notes from the previous year from myself and Vivek. However, they are not working now. I used to set it up on Mac OS, but now it looks like setting it up on Ubuntu is quite different.

I tried the following:

I now have myproxy-logon util, but I still can't find gsissh. I have gsissh-keygen.

Could you give me some suggestions? Thank you very much.

Weiming

vivek-bala commented 5 years ago

Hey Weiming, what version of ubuntu are you using? Can you paste the error you are getting?

Weiming-Hu commented 5 years ago

I'm using 18.04 Bionic Ubuntu. I'm still trying to set up gsissh, therefore I don't have any error messages from EnTK. I'm just having trouble installing gsissh.

Weiming-Hu commented 5 years ago

I can still run the script. I get the following errors. This is expected because I didn't successfully set up gsissh and I don't know how.

wuh20@sapphire:~/github/hpc-workflows/scripts/application_AnEn/year_2$ python test.py 
EnTK session: re.session.sapphire.geog.psu.edu.wuh20.017812.0005
Creating AppManager                                                           ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                    ok
2018-10-08 11:53:38,835: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : ERROR   : Resource request submission failed
2018-10-08 11:53:38,836: radical.entk.appmanager.0000: MainProcess                     : MainThread     : ERROR   : Error in AppManager: cmd gsissh not found
Traceback (most recent call last):
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 310, in run
    self._resource_manager._submit_resource_request()
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 155, in _submit_resource_request
    self._pilot = self._pmgr.submit_pilots(pdesc)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 503, in submit_pilots
    pilot = ComputePilot(pmgr=self, descr=pd)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 103, in __init__
    self._resource_sandbox = self._session._get_resource_sandbox(pilot)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/session.py", line 1020, in _get_resource_sandbox
    shell = rsup.PTYShell(js_url, self)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/saga/utils/pty_shell.py", line 247, in __init__
    interactive=self.interactive)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 173, in initialize
    posix, interactive)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 519, in _create_master_entry
    info['ssh_exe']    = self._which ("gsissh")
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 461, in _which
    raise RuntimeError('cmd %s not found' % cmd)
RuntimeError: cmd gsissh not found
Traceback (most recent call last):
  File "test.py", line 42, in <module>
    appman.run()
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 310, in run
    self._resource_manager._submit_resource_request()
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 155, in _submit_resource_request
    self._pilot = self._pmgr.submit_pilots(pdesc)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 503, in submit_pilots
    pilot = ComputePilot(pmgr=self, descr=pd)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 103, in __init__
    self._resource_sandbox = self._session._get_resource_sandbox(pilot)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/session.py", line 1020, in _get_resource_sandbox
    shell = rsup.PTYShell(js_url, self)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/saga/utils/pty_shell.py", line 247, in __init__
    interactive=self.interactive)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 173, in initialize
    posix, interactive)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 519, in _create_master_entry
    info['ssh_exe']    = self._which ("gsissh")
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 461, in _which
    raise RuntimeError('cmd %s not found' % cmd)
RuntimeError: cmd gsissh not found

Thank you very much.

vivek-bala commented 5 years ago

Okay. The instructions to set up gsissh that I shared with you last year were for xenial (ubuntu 16) and trusty (ubuntu 14). To convert the instructions for bionic, you would have to convert the string 'xenial' to 'bionic' in all the commands in the instructions.

I tried to do this for ubuntu 18 a couple of weeks back and found out that some of the packages required for gsissh were not available for ubuntu 18. You can give it a try as well, but I would urge you to look at the output messages from each of the commands. If some package is unavailable, you will get an error saying so.

Let me know how it goes.

vivek-bala commented 5 years ago

FWIW, the instructions are still valid for ubuntu 16. I reverted to ubuntu 16 from ubuntu 18 (gsissh was one of several reasons) and was able to use the same set of instructions as in that document.

Weiming-Hu commented 5 years ago

Thanks. I'm trying to use this script and replace the version and distribution names to bionic 18.

Weiming-Hu commented 5 years ago

It works. Looks like I need to install several packages in a specific order. Probably they have some dependency that I'm not aware of. Before when I run apt-get install gsi-openssh-clients it failed. Probably after I installed all the dependency in the script, I run the command again and it works perfectly. Now I have gsissh.

vivek-bala commented 5 years ago

Awesome! Glad to hear that. Can you share the final commands/script you executed to get gsissh set up on Bionic? I would like to store that alongside the current scripts. Thanks!

I can add all these to EnTK documentation itself at some point.

Weiming-Hu commented 5 years ago

Sure. I've added it to my notes.

Weiming-Hu commented 5 years ago

Closing this now. gsissh to stampede2 works but it is not working for Comet and SuperMIC. Since this is not of the main concern now as long as we have one working, I'm going to close this issue and carry on with testing my script on Stampede2. Thank you for your help.

vivek-bala commented 5 years ago

Thanks. I believe Comet uses ssh alone. SuperMIC uses gsissh but requires specifying port 2222. Just want to make sure you included the port number as well.

Weiming-Hu commented 5 years ago

This is what I've got when I tried SuperMIC and Comet.

wuh20@sapphire:~$ gsissh -p 2222 supermic.cct-lsu.xsede.org
Disconnecting 204.90.40.21 port 2222: Hash's MIC didn't verify
wuh20@sapphire:~$ ssh -l weiming comet.sdsc.xsede.org
Password: 
Last login: Mon Oct  8 10:55:11 2018 from sapphire.geog.psu.edu
Rocks 6.2 (SideWinder)
Profile built 16:45 08-Feb-2016

Kickstarted 17:27 08-Feb-2016

                      WELCOME TO 
      __________________  __  _______________
        -----/ ____/ __ \/  |/  / ____/_  __/
          --/ /   / / / / /|_/ / __/   / /
           / /___/ /_/ / /  / / /___  / /
           \____/\____/_/  /_/_____/ /_/

*******************************************************************************

[1] Example Scripts: /share/apps/examples

[2] Filesystems:

     (a) Lustre scratch filesystem : /oasis/scratch/comet/$USER/temp_project
         (Preferred: Scalable large block I/O)

     (b) Compute/GPU node local SSD storage: /scratch/$USER/$SLURM_JOBID
         (Meta-data intensive jobs, high IOPs)

     (c) Lustre projects filesystem: /oasis/projects/nsf

     (d) /home/$USER : Only for source files, libraries, binaries.
         *Do not* use for I/O intensive jobs.

[3] Comet User Guide: http://www.sdsc.edu/support/user_guides/comet.html

******************************************************************************
[weiming@comet-ln3 ~]$ 

For SuperMIC, it failed immediately with the error message. I Google it a bit and it looks like it is a bug in the OpenSSH?

For Comet, I'm still asked for credentials. Looks like gsissh is not working.

I tried Stempede, but the gsissh didn't respond for a long time. Maybe this has something to do with its being overloaded/outdated?

Then I tried Stampede2, it works perfectly. But I got the following error messages from EnTK.

(env-entk) wuh20@sapphire:~/github/hpc-workflows/scripts/application_AnEn/year_2$ python test.py 
EnTK session: re.session.sapphire.geog.psu.edu.wuh20.017812.0014
Creating AppManager                                                           ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                    ok
2018-10-08 14:58:09,758: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : ERROR   : Resource request submission failed
2018-10-08 14:58:09,759: radical.entk.appmanager.0000: MainProcess                     : MainThread     : ERROR   : Error in AppManager: Resource 'xsede.stampede_ssh' is not known.
Traceback (most recent call last):
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 310, in run
    self._resource_manager._submit_resource_request()
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 155, in _submit_resource_request
    self._pilot = self._pmgr.submit_pilots(pdesc)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 503, in submit_pilots
    pilot = ComputePilot(pmgr=self, descr=pd)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 102, in __init__
    = self._session._get_jsurl           (pilot)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/session.py", line 1118, in _get_jsurl
    rcfg    = self.get_resource_config(resrc, schema)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/session.py", line 900, in get_resource_config
    raise RuntimeError("Resource '%s' is not known." % resource)
RuntimeError: Resource 'xsede.stampede_ssh' is not known.
Traceback (most recent call last):
  File "test.py", line 40, in <module>
    appman.run()
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 310, in run
    self._resource_manager._submit_resource_request()
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 155, in _submit_resource_request
    self._pilot = self._pmgr.submit_pilots(pdesc)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 503, in submit_pilots
    pilot = ComputePilot(pmgr=self, descr=pd)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 102, in __init__
    = self._session._get_jsurl           (pilot)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/session.py", line 1118, in _get_jsurl
    rcfg    = self.get_resource_config(resrc, schema)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/pilot/session.py", line 900, in get_resource_config
    raise RuntimeError("Resource '%s' is not known." % resource)
RuntimeError: Resource 'xsede.stampede_ssh' is not known.
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 328, in _exit_function
    p.join()
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/utils/process.py", line 821, in join
    super(Process, self).join(timeout=timeout)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 148, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 165, in wait
    time.sleep(delay)
KeyboardInterrupt
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 328, in _exit_function
    p.join()
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/utils/process.py", line 821, in join
    super(Process, self).join(timeout=timeout)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 148, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 165, in wait
    time.sleep(delay)
KeyboardInterrupt

For Wrangler, gsissh also works perfectly. After I changed the resource to 'xsede.wrangler`, and ran the test script, I got the following error messages from EnTK. I submitted a ticket to XSEDE asking why my account is not associated with the project number.

(env-entk) wuh20@sapphire:~/github/hpc-workflows/scripts/application_AnEn/year_2$ python test.py 
EnTK session: re.session.sapphire.geog.psu.edu.wuh20.017812.0013
Creating AppManager                                                           ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                    ok
2018-10-08 14:57:11,556: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-2       : ERROR   : NoSuccess: Couldn't get job id from submitted job! sbatch output:

---------------------------------------------------------------
          Welcome to the Wrangler Supercomputer                 
---------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login1)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home/04672/tg839717)...OK
--> Verifying availability of your work dir (/work/04672/tg839717/wrangler)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (normal)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-MCB090174)...ERROR: User tg839717 is not associated with project TG-MCB090174 (in accounting_check_prod.pl).

Please report this problem: 
U. of TX users contact (https://portal.tacc.utexas.edu/consulting)
XSEDE    users contact (https://portal.xsede.org/group/xup/help-desk).
FAILED
removed ‘tmp_Yi0yO5.slurm’

2018-10-08 14:57:11,563: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: ERROR   : Pilot has failed
All components created
Update: Pipeline pipeline.0000 in state SCHEDULING
Update: Stage stage.0000 in state SCHEDULING
/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
Update: Task task.0000 in state SCHEDULING
Update: Task task.0000 in state SCHEDULED
Update: Stage stage.0000 in state SCHEDULED
Update: Task task.0000 in state SUBMITTING
^C2018-10-08 14:57:27,347: radical.entk.task_manager.0000: task-manager                    : MainThread     : ERROR   : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel tmgr process gracefully...
2018-10-08 14:57:27,347: radical.entk.appmanager.0000: MainProcess                     : MainThread     : ERROR   : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel enqueuer thread gracefully...
Process task-manager:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/graduate/wuh20/virtual-envs/env-entk/local/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 265, in _tmgr
    raise KeyboardInterrupt
KeyboardInterrupt

Sorry to include so much information at once. Let me know you prefer to talk in person. Thank you.

vivek-bala commented 5 years ago

For SuperMIC, it failed immediately with the error message. I Google it a bit and it looks like it is a bug in the OpenSSH?

That seems correct. I remember encountering this in Ubuntu 18 and the solution was to determine the correct version of openssh. I never could find out the version though.

For Comet, you should try ssh, not gsissh. Stampede is decommissioned, so we can't use Stampede any more.

I think its good that you can use Stampede2 right now, so do continue on that.

Weiming-Hu commented 5 years ago

OK. Do you have any experience with Wrangler? I ran the test script on Wrangler and seems like things are working except for my account is not associated with our project.

If I should continue on Stampede2, what should I do for the error message? I also tried to change resource to xsede.stampede2 but EnTK said it couldn't find the resource. Should I open another ticket instead?

Thank you

Weiming-Hu commented 5 years ago

Hah, I just heard back from XSEDE on Wrangler issue. They said that the information that I was added to the project did not get to TACC system, and they just made a manual change. I shall try for another time in half an hour. I'll keep you posted.

Weiming-Hu commented 5 years ago

After XSEDE added my account to the project, the test script can be successfully run on Wrangler. I'm going to use this platform for testing purposes. @vivek-bala Thank you.