Closed uvaaland closed 5 years ago
Hey @uvaaland , can you add any links to the user guide or documentation for the Princeton cluster. That will be helpful in planning.
The relevant cluster right now is the TigerGPU cluster: https://researchcomputing.princeton.edu/systems-and-services/available-systems/tiger
The Summit nodes will make up their own cluster. That is, it will be a mini-Summit and so the setup should be similar to what had to be done for Summit itself.
Both Andre and I have got an account on the Tiger cluster. Thanks Uno and Jeroen!
So I know my netid for princeton (but I don't think I have seen nor set a password anywhere).
Next step, according to the docs I could find, would be to enable shell access. The respective page says:
New Unix accounts (Undergraduate accounts excepted) will be set up with a default Unix shell of /bin/nologin that prohibits login access. If you wish to use a your Unix account, simply change this option by using the Enable Unix Account page. Note: The Enable Unix Account page is only accessible from the Princeton Campus Network, not off-campus.
Any idea how to handle this, and/or whom to contact for some hand holding?
Hi, Andre!
It was my understanding from talking to Vivek that he was able to log in and do what needs to be done. Maybe @vivek-bala can clarify if we are missing a step?
Hi, Andre!
Yes, let's look at this together tomorrow. I will be at the office by 0930 EST and am free to connect with you anytime outside 1100-1130, when I have a meeting. Let me know if you can work with that, and I am confident that we should be able to work through this without too much effort :)
Uno
From: Andre Merzky [notifications@github.com] Sent: Thursday, March 21, 2019 5:45 PM To: radical-collaboration/hpc-workflows Cc: Uno B. Vaaland; Mention Subject: Re: [radical-collaboration/hpc-workflows] EnTK on Princeton cluster (#82)
Hey @uvaalandhttps://github.com/uvaaland , I think I would benefit from your support. I got my netid - but all the ways to access the machine require some initial password. I never set such an initial password anywhere, and don't see any notification with an initial password.
The form (second page after entering the netid) pointshttps://puaccess.princeton.edu/ to 609-258-HELP as support contact for lost passwords - but that does not really look like a phone number? Can you interpret that number? Any advise on how I can obtain that password?
Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/radical-collaboration/hpc-workflows/issues/82#issuecomment-475415101, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANVl1P29GQlcIq9iWJBbPuwx_l-b5TbXks5vY_1pgaJpZM4ahwnh.
I find it fascinating how much the process of getting access to some machine resembles an old-school role playing game: heroic quests, discovering new pages, talking to different RPC, false leads, some light lies (yes, this is my campus address...), lots of note taking and mapping...
Long story short, I got access now - thanks to @vivek-bala for some crucial cheat codes! :-D
Haha, how exciting! I am happy to hear that it worked out. And don't hesitate to reach out if there is anything else you should need :)
Uno
From: Andre Merzky [notifications@github.com] Sent: Thursday, March 21, 2019 6:28 PM To: radical-collaboration/hpc-workflows Cc: Uno B. Vaaland; Mention Subject: Re: [radical-collaboration/hpc-workflows] EnTK on Princeton cluster (#82)
I find it fascinating how much the process of getting access to some machine resembles an old-school role playing game: heroic quests, discovering new pages, talking to different RPC, false leads, some light lies (yes, this is my campus address...), lots of note taking and mapping...
Long story short, I got access now - thanks to @vivek-balahttps://github.com/vivek-bala for some crucial cheat codes! :-D
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/radical-collaboration/hpc-workflows/issues/82#issuecomment-475427077, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANVl1Plb_1HIHUE-uW1_Y-ELWvcqwaUdks5vZAeFgaJpZM4ahwnh.
A basic setup is now functional on tiger_cpu
. I still need to test MPI tasks and GPU support, but from what I can see this should work (tests are in the queue).
Using RCT on tiger_[cg]pu
requires a manual configuration step - the pilot is not able to bootstrap itself on the compute nodes due to DNS setup issues. We may be able to work around this eventually, for now the user needs to run two commands to prepare the stack.
I will need to do some more testing, but in general I would assume that we can release this toward the end of the week - the code changes in RCT is minuscule (some fixes to rarely used code paths), its basically just adding the config entries.
Please let me know if you have a specific application I should test.
Uh, MPI still fails. Do you happen to have a recommendation on what modules to load for MPI setup?
The one that I am currently using for my runs is:
module load openmpi/intel-17.0/2.1.0/64
openmpi/intel-17.0/2.1.0/64
great, let me try that one...
@uvaaland advised to load the intel compiler module first, that resolved the mpirun problem. We are up and running now. Some more tests are needed with the actual workload, and I want to look into some bootstrapping delays - but it looks we are on track to releasing this soon.
PR for RP: https://github.com/radical-cybertools/radical.pilot/pull/1852
Note that of today I see network problems on tiger_cpu (unrelated to RP).
for the record on the network issues mentioned above:
(ve) tigercpu amerzky ~/radical/radical.pilot [feature/tiger *] $ ping github.com
PING github.com (192.30.253.112) 56(84) bytes of data.
^C
--- github.com ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 3999ms
(ve) tigercpu amerzky ~/radical/radical.pilot 1 [feature/tiger *] $ host github.com
github.com has address 192.30.253.112
github.com has address 192.30.253.113
github.com mail is handled by 5 ALT2.ASPMX.L.GOOGLE.com.
github.com mail is handled by 10 ALT4.ASPMX.L.GOOGLE.com.
github.com mail is handled by 1 ASPMX.L.GOOGLE.com.
github.com mail is handled by 5 ALT1.ASPMX.L.GOOGLE.com.
github.com mail is handled by 10 ALT3.ASPMX.L.GOOGLE.com.
(ve) tigercpu amerzky ~/radical/radical.pilot [feature/tiger *] $ ping github.com
PING github.com (192.30.253.113) 56(84) bytes of data.
64 bytes from lb-192-30-253-113-iad.github.com (192.30.253.113): icmp_seq=1 ttl=51 time=7.51 ms
64 bytes from lb-192-30-253-113-iad.github.com (192.30.253.113): icmp_seq=2 ttl=51 time=7.42 ms
^C
--- github.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 7.425/7.472/7.519/0.047 ms
It seems like the DNS resolver fails, but it's cache can be repopulated by manual resolving via host
lookup. That cache times out after some seconds. I did not attempt to work around this in the RP pilot bootstrapper, but it seems to stumble over this very problem.
From @uvaaland :
Hi, Andre! I have provided you access to the folder
/scratch/gpfs/uvaaland/research/radical_collaboration/specfem3d_globe on Tiger.
You can test that it is working by running the mesher from the above folder:
sbatch go_mesher_slurm_globe.pbs_GPU
If that works, you can inspect the "go_mesher_slurm_globe.pbs_GPU" script
to see which modules are being loaded and the executable being called.
If the pilot can run this, that would be a great stride towards what I need
to be able to do eventually. Let me know! :slightly_smiling_face:
To open a ticket with PU system administrators regarding cluster usage, send an email to cses@princeton.edu. They are usually quick to respond.
Hi @uvaaland,
I encounter troubles with OpenMPI (mpirun segfaults), but seem to have success with intel-mpi. The following modules seem to result in a sane setup:
module purge
module load intel
module load intel-mpi
module load intel-python/2.7
Your application is linked against OpenMPI. I don't seem to have enough disk quota to clone the specfm repo and to recompile with intel-mpi to check if that's viable - do you know if that works? If not, would you mind giving it a try? Meanwhile I will continue to look into OpenMPI.
The feature/tiger
branch now should work on tiger
. It is possible to submit to both the cpu
and the gpu
queue, and our tests succeed.
Tiger has some connectivity constraints which we didn't see in a while, so I had to refresh/fix some tunneling code, but that is in place now. To give it a try, you would need to (after loading the modules above):
$ git clone git@github.com:radical-cybertools/radical.pilot.git
$ cd radical.pilot
$ git checkout feature/tiger
$ ./bin/radical-pilot-ve ve
$ source ve/bin/activate
(ve) $ pip install .
(ve) $ ./bin/radical-pilot-create-static-ve ~/radical.pilot.sandbox/ve.princeton.tiger/
(ve) $ ./examples/09_mpi_units.py princeton.tiger_cpu
(ve) $ ./examples/09_mpi_units.py princeton.tiger_gpu
If you find the opportunity to try that, would you let me know if that results in successful runs of the examples?
Thanks!
Hi, Andre!
I have tried the above commands and can confirm that both examples are working for me. The only additional step I needed to take in order for it to work was:
export RADICAL_PILOT_DBURL="mongodb://user:user123@ds043012.mlab.com:43012/princeton"
If you are having issues with the quota, use the command checkquota
which will list the storage you have available. My suspicion is that you are working in /home/amerzky/
which has only 10-20GB. But you should have storage on /scratch/gpfs/amerzky/
and/or /tigress/amerzky/
which has 500GB each. I strongly recommend you work in one of the latter two folders.
@uvaaland : Ah, since $HOME
was shared, I did not even consider looking for other FS... Duh! Will do.
We hit a SNAFU on tiger, in that the available OpenMPI modules seem to be unreliable. Specifically their mpirun
seems to segfault now and then, which makes task placement in RP unstable.
We have two immediate options: (a) recompile the application against intel-mpi
(intel-mpi is based on MPICH, and the launcher seems stable AFAICT), (b) implement an srun
base launch method for RP. The second is not too difficult, but somewhat cumbersome as srun
needs us to trick it into placing tasks correctly (by modifying env settings per invocation).
I think there is something to be gained from both options, since we want to use the stack (both application and RCT) on other machines, too, and supporting intel-mpi and srun widen our options...
@uvaaland : can I ask you to check if specfm can easily be compiled with Intel-MPI? I would then look into srun
again.
The srun launch method is not implemented in radical-cybertools/radical.pilot/pull/1854
I am currently recompiling using intel-mpi/intel/2018.3/64
and rerunning the specfem mesher and solver to check that it works. Will report back once the results are in.
I have successfully recompiled and run the specfem mesher and solver on tigergpu
using the intel/18.0
and intel-mpi/intel/2018.3/64
modules.
Ah, cool - then let me switch the RP config to intel-mpi
, and the mpirun problem should be solved, too!
This seems to work as expected, and is now configured in the feature/tiger
branch of RP. Would you mind giving this a try? Thanks!
Are you referring to the following steps?
$ git clone git@github.com:radical-cybertools/radical.pilot.git
$ cd radical.pilot
$ git checkout feature/tiger
$ ./bin/radical-pilot-ve ve
$ source ve/bin/activate
(ve) $ pip install .
(ve) $ ./bin/radical-pilot-create-static-ve ~/radical.pilot.sandbox/ve.princeton.tiger/
(ve) $ ./examples/09_mpi_units.py princeton.tiger_cpu
(ve) $ ./examples/09_mpi_units.py princeton.tiger_gpu
That sounds correct to me.
Yes!
I have run the above steps again on Tiger and both examples run successfully.
Awesome!
Thanks!
I am still seeing the same issue trying to run EnTK from my desktop machine. I did a fresh install of the feature/tiger branch, but the error is still the same: Resource allocation failing with the same complaint:
NoSuccess: Error finding SLURM tool squeue on remote server slurm://localhost/!
Then I tried running only the Pilot on Tigercpu from my desktop machine. I did so by following the steps:
$ git clone git@github.com:radical-cybertools/radical.pilot.git $ cd radical.pilot $ git checkout feature/tiger $ ./bin/radical-pilot-ve ve $ source ve/bin/activate (ve) $ pip install . (ve) $ ./bin/radical-pilot-create-static-ve ~/radical.pilot.sandbox/ve.princeton.tiger/ (ve) $ ./examples/09_mpi_units.py princeton.tiger_cpu
But modifying ./examples/09_mpi_units.py to use ssh as the schema in the resource allocation. Then I get an error:
caught Exception: prompted for unknown password (Password: ) (/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_shell_factory.py +321 (_initialize_pty) : % match))
Though I have passwordless ssh set up from my desktop machine to Tiger, which I tested separately on the command line.
Trying to run the example script again, I get a similar, but different message:
Permission denied (keyboard-interactive).
)) (/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_exceptions.py +44 (translate_exception) : elif 'pass' in lmsg: e = se.AuthenticationFailed(cmsg))
Traceback (most recent call last):
File "/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 313, in _initialize_pty
n, match = pty_shell.find (prompt_patterns, delay)
File "/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_process.py", line 796, in find
raise ptye.translate_exception (e, "(%s)" % data)
AuthenticationFailed: read from process failed '[Errno 5] Input/output error' : (f the following options:
1. Duo Push to XXX-XXX-3048
2. Phone call to XXX-XXX-3048
3. SMS passcodes to XXX-XXX-3048 (next code starts with: 1)
It is running into an issue with DUO, but this is interesting because it is not an issue if I ssh
onto Tiger from the command line. That is, the following command:
ssh uvaaland@tigercpu.princeton.edu
works without prompting me for a password or DUO, but when I run the example in the same terminal window it fails with the DUO issue. The full radical.saga.pty.log
file is attached.
Thanks Uno - can you please also set (if you don't have those yet)
export RADICAL_VERBOSE=DEBUG
and also attach a tarball with the session subdir and the logfiles therein? Thanks.
Great! I ran it again with the additional debug flag. Attached is the radical.saga.pty.log
file and the RP session folder.
2019-05-16 08:54:58,155: radical.saga.pty : MainProcess : MainThread : DEBUG : read : [ 194] [ 43] (Permission denied (keyboard-interactive).\n)
2019-05-16 08:54:58,160: radical.saga.pty : MainProcess : MainThread : DEBUG : Traceback (most recent call last):
File "/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_process.py", line 793, in find
data += self.read (timeout=_POLLDELAY)
File "/home/uvaaland/scratch/tmp/radical.pilot/ve/lib/python2.7/site-packages/radical/saga/utils/pty_process.py", line 682, in read
% (e, self.tail))
NoSuccess: read from process failed '[Errno 5] Input/output error' : (f the following options:
1. Duo Push to XXX-XXX-3048
2. Phone call to XXX-XXX-3048
3. SMS passcodes to XXX-XXX-3048 (next code starts with: 1)
Passcode or option (1-3):
Incorrect passcode. Please try again.
It seems like you don't have passwordless access from the machine you are using to the tiger cluster. Can you confirm if you do?
If you do, can you check if your hostname (/bin/hostname), on the machine you are trying to access from, is the same as your user id/login name on tiger?
Agreed. What I was telling Andre is that the connection works from the command line. No password, no DUO. But for some reason that is not the case when the pilot tries to do the same. I checked the resource file for Tiger and saw that it uses the same as I do from the command line (uvaaland@tigercpu.princeton.edu) and the user is always my Princeton username (uvaaland).
Andre was going to see if he could find the line that the Pilot uses to connect such that I could try and run the same line from the command line and check if that works.
I see, okay. I think Andre will be able to debug this. FWIW, the ssh command executed seems to be:
/usr/bin/env TERM=vt100 /usr/bin/ssh -t -o ControlMaster=auto -o ControlPath=/tmp/saga_ssh_uvaaland_%h_%p.ctrl -o TCPKeepAlive=no -o ServerAliveInterval=10 -o ServerAliveCountMax=20 -o ConnectTimeout=10 tigercpu.princeton.edu
Taking a shot in the dark, can you try after adding the following to $HOME/.ssh/config
(if you don't already have this):
Host tigercpu.princeton.edu princeton.tigercpu
Hostname tigercpu.princeton.edu
User uvaaland
Yes, that is what I use, but had not set the user explicitly. I just set this and tried to run it again, but the error is the same.
I tried running the above ssh
command from the command line and indeed it gets stuck on DUO, but once I do that it does not prompt me for a password. So this is really a DUO issue.
Yea, I would say so. The 'User' is essentially your username on Tiger (whoami
on Tiger should confirm that). After adding the entry in .ssh/config file, I would suggest trying
/usr/bin/env TERM=vt100 /usr/bin/ssh -t -o ControlMaster=auto -o ControlPath=/tmp/saga_ssh_uvaaland_%h_%p.ctrl -o TCPKeepAlive=no -o ServerAliveInterval=10 -o ServerAliveCountMax=20 -o ConnectTimeout=10 tigercpu.princeton.edu
one more time.
So this is really a DUO issue.
Maybe some of the parameters are conflicting with the DUO settings? Let's see what Andre suggests.
This is very interesting. Because when I run the same command a second time in the same terminal, it goes right through without a DUO issue. That is, if I run the above command from the command line and put in DUO, then I run the Pilot with the example script ./examples/09_mpi_units.py princeton.tiger_cpu
, it does not get stuck on DUO.
I still don't see any jobs submitted on Tiger, but the Pilot reaches the gather results
stage.
So circumventing DUO is something that I might just have to do by running the ssh command in the command line once before I try to launch an application with the Pilot. It is not a perfect solution, but something we can work with at least.
Yea, that sounds quite odd. Maybe there is a DUO timeout per session? You can keep the session persistent by using something like tmux: you can start a tmux session and setup DUO and continue using EnTK/RP in that session. You will be able to connect and disconnect with this session, checkout this for a quick how-to on tmux if required.
Yes, I am using tmux. The way I have DUO configured is such that you require one DUO authentication per internet connection. That is, as long as my tmux session is open and the internet does not drop on my local desktop (which is very rare since it uses a cable), I should not have to deal with DUO again.
Now that we have found a way to get past this step, I don't expect it to be an issue, and I will focus on what else might be needed to get the pilot example to work when launching it from my desktop machine. Currently, it hangs at gather results
and have done so for the past 15 minutes. I have also checked the queue on Tiger and have not seen any jobs being submitted.
The way I have DUO configured is such that you require one DUO authentication per internet connection.
Is this based on your IP? RP underneath starts multiple ssh connections (3 I think). Anyway, good that this isn't an halting issue anymore.
I am not sure. But it means that anytime I ssh
from the same machine, I will not be prompted for DUO. So it should not be an issue that it opens up several ssh connections as long as DUO has been put in once.
@jeroentromp expressed interest in setting up the pilot + EnTK on one of the Princeton clusters to have a local testing ground for us to try things out before taking them to Summit.
Princeton will be getting a couple of Summit nodes in the near future, which might make the setup easier.