vsoch / forward

Port Forwarding Utility
https://vsoch.github.io/lessons/sherlock-singularity/
MIT License
52 stars 27 forks source link

Access denied by pam_slurm_adopt: you have no active jobs on this node #29

Closed timydaley closed 5 years ago

timydaley commented 5 years ago

Hi Vanessa, Thank you for creating this tool on sherlock. I was following the instructions at https://vsoch.github.io/lessons/sherlock-jupyter/ and I run into the following issue. I'm hoping you can help me understand the problem. When I run start.sh I get the following errors, and when I try to open the notebook in my browser (using the following address) it fails. But the job is running.

[tdaley@sh-ln08 login /scratch/PI/whwong/tdaley/programs/forward]$ bash start.sh py3-jupyter /scratch/PI/whwong/tdaley/sgRNA/CRISPRa-sgRNA-determinants/deepLearningMixtureRegression/ == Finding Script == Looking for sbatches/sherlock/py3-jupyter.sbatch Script sbatches/sherlock/py3-jupyter.sbatch

== Checking for previous notebook == No existing py3-jupyter jobs found, continuing...

== Getting destination directory ==

== Uploading sbatch script == py3-jupyter.sbatch 100% 146 29.6KB/s 00:00

== Submitting sbatch == sbatch --job-name=py3-jupyter --partition=whwong --output=/home/users/tdaley/forward-util/py3-jupyter.sbatch.out --error=/home/users/tdaley/forward-util/py3-jupyter.sbatch.err --mem=20G --time=8:00:00 /home/users/tdaley/forward-util/py3-jupyter.sbatch 58668 "/scratch/PI/whwong/tdaley/sgRNA/CRISPRa-sgRNA-determinants/deepLearningMixtureRegression/" Submitted batch job 34562816

== View logs in separate terminal == ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.out ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.err

== Waiting for job to start, using exponential backoff == Attempt 0: not ready yet... retrying in 1.. Attempt 1: not ready yet... retrying in 2.. Attempt 2: not ready yet... retrying in 4.. Attempt 3: not ready yet... retrying in 8.. Attempt 4: not ready yet... retrying in 16.. Attempt 5: not ready yet... retrying in 32.. Attempt 6: resources allocated to sh-08-13!.. sh-08-13 sh-08-13 notebook running on sh-08-13

== Setting up port forwarding == ssh -L 58668:localhost:58668 sherlock ssh -L 58668:localhost:58668 -N sh-08-13 & Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed. == Connecting to notebook == [I 18:10:27.968 NotebookApp] Writing notebook server cookie secret to /tmp/jupyter/notebook_cookie_secret [I 18:10:29.512 NotebookApp] Serving notebooks from local directory: /scratch/groups/whwong/tdaley/sgRNA/CRISPRa-sgRNA-determinants/deepLearningMixtureRegression [I 18:10:29.512 NotebookApp] 0 active kernels [I 18:10:29.512 NotebookApp] The Jupyter Notebook is running at: http://localhost:58667/ [I 18:10:29.512 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). slurmstepd: error: JOB 34562525 ON sh-08-13 CANCELLED AT 2018-12-22T18:14:40

== View logs in separate terminal == ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.out ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.err

== Instructions ==

  1. Password, output, and error printed to this terminal? Look at logs (see instruction above)
  2. Browser: http://sh-02-21.int:58668/ -> http://localhost:58668/...
  3. To end session: bash end.sh py3-jupyter

[tdaley@sh-ln08 login /scratch/PI/whwong/tdaley/programs/forward]$ jobs 34562816 whwong py3-jupy tdaley R 2:51 1 sh-08-13

Thank you for your help and I apologize if I missed something super obvious.

vsoch commented 5 years ago

hey @timydaley ! The error message is hinting that the job submission failed (hence why you don't have an active job). Could you show me the output in the files that were printed / shown in the interface? They are on the cluster:

ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.out
ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.err
timydaley commented 5 years ago

Wow! Thank you for the quick reply.

[tdaley@sh-ln05 login ~]$ ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.out [tdaley@sh-ln05 login ~]$ ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.err [I 18:17:21.359 NotebookApp] Writing notebook server cookie secret to /tmp/jupyter/notebook_cookie_secret [I 18:17:21.459 NotebookApp] Serving notebooks from local directory: /scratch/groups/whwong/tdaley/sgRNA/CRISPRa-sgRNA-determinants/deepLearningMixtureRegression [I 18:17:21.459 NotebookApp] 0 active kernels [I 18:17:21.459 NotebookApp] The Jupyter Notebook is running at: http://localhost:58668/ [I 18:17:21.459 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). slurmstepd: error: JOB 34562816 ON sh-08-13 CANCELLED AT 2018-12-22T18:41:25

vsoch commented 5 years ago

okay, so likely the job is being cancelled due to:

So let's see if we can change those parameters to get a node that is working. Next, please try running the sbatch command on the cluster directly, but remove the custom directory to start in, and reduce the memory and time:

sbatch --job-name=py3-jupyter --partition=whwong --output=/home/users/tdaley/forward-util/py3-jupyter.sbatch.out --error=/home/users/tdaley/forward-util/py3-jupyter.sbatch.err --mem=8G --time=8:00:00 /home/users/tdaley/forward-util/py3-jupyter.sbatch 58668 

If that works, you can add in each variable again to see what breaks it. If not, let's go a bit deeper. You can first get an interactive node with the same parameters:

srun ----partition=whwong --mem=8G --time=8:00:00 --pty bash

and then walk through each of the commands in the python script, and see if your job is killed (cancelled).

PORT=58668
NOTEBOOK_DIR=/home/users/tdaley/forward-util/py3-jupyter.sbatch
cd $NOTEBOOK_DIR

module load py-jupyter/1.0.0_py36
jupyter notebook --no-browser --port=$PORT

Let me know if we learn anything! I think that Slurm is killing the job and it's not clear what the issue is!

nathankong commented 5 years ago

Hi Vanessa, I'm getting the exact same error. Running the sbatch command on the cluster directly works (the job is successfully up and running). What else should I do to troubleshoot? Thanks!

vsoch commented 5 years ago

If the error message reports no active jobs on the node, then likely it is the case that when the ssh was attempted, there wasn't an active job. You can test this by launching the job as you are doing (and reporting working) and then issuing the command to connect with ssh from your host. If that works, then it's a timing issue (and slurm is slow to allocate the job) and we can possibly add more delay. And you can also try OnDemand https://login.sherlock.stanford.edu that has various notebooks, apps, etc.

timydaley commented 5 years ago

So, I ended up abandoning trying to get Jupyter to work because I found that Sherlock has a new service that makes using Jupyter notebooks easy.
https://www.sherlock.stanford.edu/docs/user-guide/ondemand/

vsoch commented 5 years ago

yep that's exactly what I just referenced :)

nathankong commented 5 years ago

Whoa. This is super simple. Thanks for this and for your quick responses!

vsoch commented 5 years ago

Awesome! Since you are both happy (and I'm very glad to deprecate this tool for the better solution) I'm going to close the issue. Happy OnDemand-ing!