EnTK on Princeton cluster

uvaaland commented 5 years ago

@jeroentromp expressed interest in setting up the pilot + EnTK on one of the Princeton clusters to have a local testing ground for us to try things out before taking them to Summit.

Princeton will be getting a couple of Summit nodes in the near future, which might make the setup easier.

vivek-bala commented 5 years ago

Hey @uvaaland , can you add any links to the user guide or documentation for the Princeton cluster. That will be helpful in planning.

uvaaland commented 5 years ago

The relevant cluster right now is the TigerGPU cluster: https://researchcomputing.princeton.edu/systems-and-services/available-systems/tiger

The Summit nodes will make up their own cluster. That is, it will be a mini-Summit and so the setup should be similar to what had to be done for Summit itself.

vivek-bala commented 5 years ago

Both Andre and I have got an account on the Tiger cluster. Thanks Uno and Jeroen!

andre-merzky commented 5 years ago

So I know my netid for princeton (but I don't think I have seen nor set a password anywhere).

Next step, according to the docs I could find, would be to enable shell access. The respective page says:

New Unix accounts (Undergraduate accounts excepted) will be set up with a default Unix shell of /bin/nologin that prohibits login access. If you wish to use a your Unix account, simply change this option by using the Enable Unix Account page. Note: The Enable Unix Account page is only accessible from the Princeton Campus Network, not off-campus.

Any idea how to handle this, and/or whom to contact for some hand holding?

uvaaland commented 5 years ago

Hi, Andre!

It was my understanding from talking to Vivek that he was able to log in and do what needs to be done. Maybe @vivek-bala can clarify if we are missing a step?

uvaaland commented 5 years ago

Hi, Andre!

Yes, let's look at this together tomorrow. I will be at the office by 0930 EST and am free to connect with you anytime outside 1100-1130, when I have a meeting. Let me know if you can work with that, and I am confident that we should be able to work through this without too much effort :)

Uno

From: Andre Merzky [notifications@github.com] Sent: Thursday, March 21, 2019 5:45 PM To: radical-collaboration/hpc-workflows Cc: Uno B. Vaaland; Mention Subject: Re: [radical-collaboration/hpc-workflows] EnTK on Princeton cluster (#82)

Hey @uvaalandhttps://github.com/uvaaland , I think I would benefit from your support. I got my netid - but all the ways to access the machine require some initial password. I never set such an initial password anywhere, and don't see any notification with an initial password.

The form (second page after entering the netid) pointshttps://puaccess.princeton.edu/ to 609-258-HELP as support contact for lost passwords - but that does not really look like a phone number? Can you interpret that number? Any advise on how I can obtain that password?

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/radical-collaboration/hpc-workflows/issues/82#issuecomment-475415101, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANVl1P29GQlcIq9iWJBbPuwx_l-b5TbXks5vY_1pgaJpZM4ahwnh.

andre-merzky commented 5 years ago

I find it fascinating how much the process of getting access to some machine resembles an old-school role playing game: heroic quests, discovering new pages, talking to different RPC, false leads, some light lies (yes, this is my campus address...), lots of note taking and mapping...

Long story short, I got access now - thanks to @vivek-bala for some crucial cheat codes! :-D

uvaaland commented 5 years ago

Haha, how exciting! I am happy to hear that it worked out. And don't hesitate to reach out if there is anything else you should need :)

Uno

From: Andre Merzky [notifications@github.com] Sent: Thursday, March 21, 2019 6:28 PM To: radical-collaboration/hpc-workflows Cc: Uno B. Vaaland; Mention Subject: Re: [radical-collaboration/hpc-workflows] EnTK on Princeton cluster (#82)

I find it fascinating how much the process of getting access to some machine resembles an old-school role playing game: heroic quests, discovering new pages, talking to different RPC, false leads, some light lies (yes, this is my campus address...), lots of note taking and mapping...

Long story short, I got access now - thanks to @vivek-balahttps://github.com/vivek-bala for some crucial cheat codes! :-D

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/radical-collaboration/hpc-workflows/issues/82#issuecomment-475427077, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANVl1Plb_1HIHUE-uW1_Y-ELWvcqwaUdks5vZAeFgaJpZM4ahwnh.

andre-merzky commented 5 years ago

A basic setup is now functional on tiger_cpu. I still need to test MPI tasks and GPU support, but from what I can see this should work (tests are in the queue).

Using RCT on tiger_[cg]pu requires a manual configuration step - the pilot is not able to bootstrap itself on the compute nodes due to DNS setup issues. We may be able to work around this eventually, for now the user needs to run two commands to prepare the stack.

I will need to do some more testing, but in general I would assume that we can release this toward the end of the week - the code changes in RCT is minuscule (some fixes to rarely used code paths), its basically just adding the config entries.

Please let me know if you have a specific application I should test.

andre-merzky commented 5 years ago

Uh, MPI still fails. Do you happen to have a recommendation on what modules to load for MPI setup?

uvaaland commented 5 years ago

The one that I am currently using for my runs is:

module load openmpi/intel-17.0/2.1.0/64

andre-merzky commented 5 years ago

openmpi/intel-17.0/2.1.0/64

great, let me try that one...

andre-merzky commented 5 years ago

@uvaaland advised to load the intel compiler module first, that resolved the mpirun problem. We are up and running now. Some more tests are needed with the actual workload, and I want to look into some bootstrapping delays - but it looks we are on track to releasing this soon.

andre-merzky commented 5 years ago

PR for RP: https://github.com/radical-cybertools/radical.pilot/pull/1852

Note that of today I see network problems on tiger_cpu (unrelated to RP).

andre-merzky commented 5 years ago

for the record on the network issues mentioned above:

(ve)  tigercpu  amerzky  ~/radical/radical.pilot  [feature/tiger *] $ ping github.com
PING github.com (192.30.253.112) 56(84) bytes of data.
^C
--- github.com ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 3999ms

(ve)  tigercpu  amerzky  ~/radical/radical.pilot  1  [feature/tiger *] $ host github.com
github.com has address 192.30.253.112
github.com has address 192.30.253.113
github.com mail is handled by 5 ALT2.ASPMX.L.GOOGLE.com.
github.com mail is handled by 10 ALT4.ASPMX.L.GOOGLE.com.
github.com mail is handled by 1 ASPMX.L.GOOGLE.com.
github.com mail is handled by 5 ALT1.ASPMX.L.GOOGLE.com.
github.com mail is handled by 10 ALT3.ASPMX.L.GOOGLE.com.

(ve)  tigercpu  amerzky  ~/radical/radical.pilot  [feature/tiger *] $ ping github.com
PING github.com (192.30.253.113) 56(84) bytes of data.
64 bytes from lb-192-30-253-113-iad.github.com (192.30.253.113): icmp_seq=1 ttl=51 time=7.51 ms
64 bytes from lb-192-30-253-113-iad.github.com (192.30.253.113): icmp_seq=2 ttl=51 time=7.42 ms
^C
--- github.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 7.425/7.472/7.519/0.047 ms

It seems like the DNS resolver fails, but it's cache can be repopulated by manual resolving via host lookup. That cache times out after some seconds. I did not attempt to work around this in the RP pilot bootstrapper, but it seems to stumble over this very problem.

mturilli commented 5 years ago

Open a ticket with support
Princeton made available a real-life use-case script for testing
Confirm support and release.

andre-merzky commented 5 years ago

From @uvaaland :

Hi, Andre! I have provided you access to the folder
/scratch/gpfs/uvaaland/research/radical_collaboration/specfem3d_globe on Tiger.
You can test that it is working by running the mesher from the above folder:

  sbatch go_mesher_slurm_globe.pbs_GPU

If that works, you can inspect the "go_mesher_slurm_globe.pbs_GPU" script
to see which modules are being loaded and the executable being called.
If the pilot can run this, that would be a great stride towards what I need
 to be able to do eventually. Let me know! :slightly_smiling_face:

uvaaland commented 5 years ago

To open a ticket with PU system administrators regarding cluster usage, send an email to cses@princeton.edu. They are usually quick to respond.

andre-merzky commented 5 years ago

Hi @uvaaland,

I encounter troubles with OpenMPI (mpirun segfaults), but seem to have success with intel-mpi. The following modules seem to result in a sane setup:

module purge
module load intel
module load intel-mpi
module load intel-python/2.7

Your application is linked against OpenMPI. I don't seem to have enough disk quota to clone the specfm repo and to recompile with intel-mpi to check if that's viable - do you know if that works? If not, would you mind giving it a try? Meanwhile I will continue to look into OpenMPI.

The feature/tiger branch now should work on tiger. It is possible to submit to both the cpu and the gpu queue, and our tests succeed.

Tiger has some connectivity constraints which we didn't see in a while, so I had to refresh/fix some tunneling code, but that is in place now. To give it a try, you would need to (after loading the modules above):

$ git clone git@github.com:radical-cybertools/radical.pilot.git
$ cd radical.pilot
$ git checkout feature/tiger
$ ./bin/radical-pilot-ve ve
$ source ve/bin/activate
(ve) $ pip install .
(ve) $ ./bin/radical-pilot-create-static-ve ~/radical.pilot.sandbox/ve.princeton.tiger/
(ve) $ ./examples/09_mpi_units.py princeton.tiger_cpu
(ve) $ ./examples/09_mpi_units.py princeton.tiger_gpu

If you find the opportunity to try that, would you let me know if that results in successful runs of the examples?

Thanks!

uvaaland commented 5 years ago

Hi, Andre!

I have tried the above commands and can confirm that both examples are working for me. The only additional step I needed to take in order for it to work was:

export RADICAL_PILOT_DBURL="mongodb://user:user123@ds043012.mlab.com:43012/princeton"

If you are having issues with the quota, use the command checkquota which will list the storage you have available. My suspicion is that you are working in /home/amerzky/ which has only 10-20GB. But you should have storage on /scratch/gpfs/amerzky/ and/or /tigress/amerzky/ which has 500GB each. I strongly recommend you work in one of the latter two folders.

andre-merzky commented 5 years ago

@uvaaland : Ah, since $HOME was shared, I did not even consider looking for other FS... Duh! Will do.

andre-merzky commented 5 years ago

We hit a SNAFU on tiger, in that the available OpenMPI modules seem to be unreliable. Specifically their mpirun seems to segfault now and then, which makes task placement in RP unstable.

We have two immediate options: (a) recompile the application against intel-mpi (intel-mpi is based on MPICH, and the launcher seems stable AFAICT), (b) implement an srun base launch method for RP. The second is not too difficult, but somewhat cumbersome as srun needs us to trick it into placing tasks correctly (by modifying env settings per invocation).

I think there is something to be gained from both options, since we want to use the stack (both application and RCT) on other machines, too, and supporting intel-mpi and srun widen our options...

@uvaaland : can I ask you to check if specfm can easily be compiled with Intel-MPI? I would then look into srun again.

andre-merzky commented 5 years ago

The srun launch method is not implemented in radical-cybertools/radical.pilot/pull/1854

uvaaland commented 5 years ago

I am currently recompiling using intel-mpi/intel/2018.3/64 and rerunning the specfem mesher and solver to check that it works. Will report back once the results are in.

uvaaland commented 5 years ago

I have successfully recompiled and run the specfem mesher and solver on tigergpu using the intel/18.0 and intel-mpi/intel/2018.3/64 modules.

andre-merzky commented 5 years ago

Ah, cool - then let me switch the RP config to intel-mpi, and the mpirun problem should be solved, too!

andre-merzky commented 5 years ago

This seems to work as expected, and is now configured in the feature/tiger branch of RP. Would you mind giving this a try? Thanks!

uvaaland commented 5 years ago

Are you referring to the following steps?

$ git clone git@github.com:radical-cybertools/radical.pilot.git
$ cd radical.pilot
$ git checkout feature/tiger
$ ./bin/radical-pilot-ve ve
$ source ve/bin/activate
(ve) $ pip install .
(ve) $ ./bin/radical-pilot-create-static-ve ~/radical.pilot.sandbox/ve.princeton.tiger/
(ve) $ ./examples/09_mpi_units.py princeton.tiger_cpu
(ve) $ ./examples/09_mpi_units.py princeton.tiger_gpu

vivek-bala commented 5 years ago

That sounds correct to me.

andre-merzky commented 5 years ago

Yes!

uvaaland commented 5 years ago

I have run the above steps again on Tiger and both examples run successfully.

vivek-bala commented 5 years ago

Awesome!

andre-merzky commented 5 years ago

Thanks!

uvaaland commented 5 years ago

I am still seeing the same issue trying to run EnTK from my desktop machine. I did a fresh install of the feature/tiger branch, but the error is still the same: Resource allocation failing with the same complaint:

NoSuccess: Error finding SLURM tool squeue on remote server slurm://localhost/!

Then I tried running only the Pilot on Tigercpu from my desktop machine. I did so by following the steps:

$ git clone git@github.com:radical-cybertools/radical.pilot.git $ cd radical.pilot $ git checkout feature/tiger $ ./bin/radical-pilot-ve ve $ source ve/bin/activate (ve) $ pip install . (ve) $ ./bin/radical-pilot-create-static-ve ~/radical.pilot.sandbox/ve.princeton.tiger/ (ve) $ ./examples/09_mpi_units.py princeton.tiger_cpu

But modifying ./examples/09_mpi_units.py to use ssh as the schema in the resource allocation. Then I get an error:

caught Exception: prompted for unknown password (Password: ) (/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_shell_factory.py +321 (_initialize_pty) : % match))

Though I have passwordless ssh set up from my desktop machine to Tiger, which I tested separately on the command line.

uvaaland commented 5 years ago

Trying to run the example script again, I get a similar, but different message:

Permission denied (keyboard-interactive).
)) (/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_exceptions.py +44 (translate_exception)  :  elif 'pass'                       in lmsg: e = se.AuthenticationFailed(cmsg))
Traceback (most recent call last):
  File "/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 313, in _initialize_pty
    n, match = pty_shell.find (prompt_patterns, delay)
  File "/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_process.py", line 796, in find
    raise ptye.translate_exception (e, "(%s)" % data)
AuthenticationFailed: read from process failed '[Errno 5] Input/output error' : (f the following options:

 1. Duo Push to XXX-XXX-3048
 2. Phone call to XXX-XXX-3048
 3. SMS passcodes to XXX-XXX-3048 (next code starts with: 1)

It is running into an issue with DUO, but this is interesting because it is not an issue if I ssh onto Tiger from the command line. That is, the following command:

ssh uvaaland@tigercpu.princeton.edu

works without prompting me for a password or DUO, but when I run the example in the same terminal window it fails with the DUO issue. The full radical.saga.pty.log file is attached.

radical.saga.pty.log

andre-merzky commented 5 years ago

Thanks Uno - can you please also set (if you don't have those yet)

export RADICAL_VERBOSE=DEBUG

and also attach a tarball with the session subdir and the logfiles therein? Thanks.

uvaaland commented 5 years ago

Great! I ran it again with the additional debug flag. Attached is the radical.saga.pty.log file and the RP session folder.

issue.tar.gz

vivek-bala commented 5 years ago

2019-05-16 08:54:58,155: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : read : [  194] [   43] (Permission denied (keyboard-interactive).\n)
2019-05-16 08:54:58,160: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : Traceback (most recent call last):
  File "/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_process.py", line 793, in find
    data += self.read (timeout=_POLLDELAY)
  File "/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_process.py", line 682, in read
    % (e, self.tail))
NoSuccess: read from process failed '[Errno 5] Input/output error' : (f the following options:

 1. Duo Push to XXX-XXX-3048
 2. Phone call to XXX-XXX-3048
 3. SMS passcodes to XXX-XXX-3048 (next code starts with: 1)

Passcode or option (1-3): 
Incorrect passcode. Please try again.

It seems like you don't have passwordless access from the machine you are using to the tiger cluster. Can you confirm if you do?

If you do, can you check if your hostname (/bin/hostname), on the machine you are trying to access from, is the same as your user id/login name on tiger?

uvaaland commented 5 years ago

Agreed. What I was telling Andre is that the connection works from the command line. No password, no DUO. But for some reason that is not the case when the pilot tries to do the same. I checked the resource file for Tiger and saw that it uses the same as I do from the command line (uvaaland@tigercpu.princeton.edu) and the user is always my Princeton username (uvaaland).

Andre was going to see if he could find the line that the Pilot uses to connect such that I could try and run the same line from the command line and check if that works.

vivek-bala commented 5 years ago

I see, okay. I think Andre will be able to debug this. FWIW, the ssh command executed seems to be:

/usr/bin/env TERM=vt100 /usr/bin/ssh -t -o ControlMaster=auto -o ControlPath=/tmp/saga_ssh_uvaaland_%h_%p.ctrl -o TCPKeepAlive=no -o ServerAliveInterval=10 -o ServerAliveCountMax=20 -o ConnectTimeout=10 tigercpu.princeton.edu

vivek-bala commented 5 years ago

Taking a shot in the dark, can you try after adding the following to $HOME/.ssh/config (if you don't already have this):

Host tigercpu.princeton.edu princeton.tigercpu
    Hostname tigercpu.princeton.edu
    User uvaaland

uvaaland commented 5 years ago

Yes, that is what I use, but had not set the user explicitly. I just set this and tried to run it again, but the error is the same.

I tried running the above ssh command from the command line and indeed it gets stuck on DUO, but once I do that it does not prompt me for a password. So this is really a DUO issue.

vivek-bala commented 5 years ago

Yea, I would say so. The 'User' is essentially your username on Tiger (whoami on Tiger should confirm that). After adding the entry in .ssh/config file, I would suggest trying

/usr/bin/env TERM=vt100 /usr/bin/ssh -t -o ControlMaster=auto -o ControlPath=/tmp/saga_ssh_uvaaland_%h_%p.ctrl -o TCPKeepAlive=no -o ServerAliveInterval=10 -o ServerAliveCountMax=20 -o ConnectTimeout=10 tigercpu.princeton.edu

one more time.

vivek-bala commented 5 years ago

So this is really a DUO issue.

Maybe some of the parameters are conflicting with the DUO settings? Let's see what Andre suggests.

uvaaland commented 5 years ago

This is very interesting. Because when I run the same command a second time in the same terminal, it goes right through without a DUO issue. That is, if I run the above command from the command line and put in DUO, then I run the Pilot with the example script ./examples/09_mpi_units.py princeton.tiger_cpu, it does not get stuck on DUO.

uvaaland commented 5 years ago

I still don't see any jobs submitted on Tiger, but the Pilot reaches the gather results stage.

So circumventing DUO is something that I might just have to do by running the ssh command in the command line once before I try to launch an application with the Pilot. It is not a perfect solution, but something we can work with at least.

vivek-bala commented 5 years ago

Yea, that sounds quite odd. Maybe there is a DUO timeout per session? You can keep the session persistent by using something like tmux: you can start a tmux session and setup DUO and continue using EnTK/RP in that session. You will be able to connect and disconnect with this session, checkout this for a quick how-to on tmux if required.

uvaaland commented 5 years ago

Yes, I am using tmux. The way I have DUO configured is such that you require one DUO authentication per internet connection. That is, as long as my tmux session is open and the internet does not drop on my local desktop (which is very rare since it uses a cable), I should not have to deal with DUO again.

Now that we have found a way to get past this step, I don't expect it to be an issue, and I will focus on what else might be needed to get the pilot example to work when launching it from my desktop machine. Currently, it hangs at gather results and have done so for the past 15 minutes. I have also checked the queue on Tiger and have not seen any jobs being submitted.

vivek-bala commented 5 years ago

The way I have DUO configured is such that you require one DUO authentication per internet connection.

Is this based on your IP? RP underneath starts multiple ssh connections (3 I think). Anyway, good that this isn't an halting issue anymore.

uvaaland commented 5 years ago

I am not sure. But it means that anytime I ssh from the same machine, I will not be prompted for DUO. So it should not be an issue that it opens up several ssh connections as long as DUO has been put in once.

radical-collaboration / hpc-workflows

EnTK on Princeton cluster #82