Closed timydaley closed 5 years ago
hey @timydaley ! The error message is hinting that the job submission failed (hence why you don't have an active job). Could you show me the output in the files that were printed / shown in the interface? They are on the cluster:
ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.out
ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.err
Wow! Thank you for the quick reply.
[tdaley@sh-ln05 login ~]$ ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.out [tdaley@sh-ln05 login ~]$ ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.err [I 18:17:21.359 NotebookApp] Writing notebook server cookie secret to /tmp/jupyter/notebook_cookie_secret [I 18:17:21.459 NotebookApp] Serving notebooks from local directory: /scratch/groups/whwong/tdaley/sgRNA/CRISPRa-sgRNA-determinants/deepLearningMixtureRegression [I 18:17:21.459 NotebookApp] 0 active kernels [I 18:17:21.459 NotebookApp] The Jupyter Notebook is running at: http://localhost:58668/ [I 18:17:21.459 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). slurmstepd: error: JOB 34562816 ON sh-08-13 CANCELLED AT 2018-12-22T18:41:25
okay, so likely the job is being cancelled due to:
So let's see if we can change those parameters to get a node that is working. Next, please try running the sbatch command on the cluster directly, but remove the custom directory to start in, and reduce the memory and time:
sbatch --job-name=py3-jupyter --partition=whwong --output=/home/users/tdaley/forward-util/py3-jupyter.sbatch.out --error=/home/users/tdaley/forward-util/py3-jupyter.sbatch.err --mem=8G --time=8:00:00 /home/users/tdaley/forward-util/py3-jupyter.sbatch 58668
If that works, you can add in each variable again to see what breaks it. If not, let's go a bit deeper. You can first get an interactive node with the same parameters:
srun ----partition=whwong --mem=8G --time=8:00:00 --pty bash
and then walk through each of the commands in the python script, and see if your job is killed (cancelled).
PORT=58668
NOTEBOOK_DIR=/home/users/tdaley/forward-util/py3-jupyter.sbatch
cd $NOTEBOOK_DIR
module load py-jupyter/1.0.0_py36
jupyter notebook --no-browser --port=$PORT
Let me know if we learn anything! I think that Slurm is killing the job and it's not clear what the issue is!
Hi Vanessa, I'm getting the exact same error. Running the sbatch command on the cluster directly works (the job is successfully up and running). What else should I do to troubleshoot? Thanks!
If the error message reports no active jobs on the node, then likely it is the case that when the ssh was attempted, there wasn't an active job. You can test this by launching the job as you are doing (and reporting working) and then issuing the command to connect with ssh from your host. If that works, then it's a timing issue (and slurm is slow to allocate the job) and we can possibly add more delay. And you can also try OnDemand https://login.sherlock.stanford.edu that has various notebooks, apps, etc.
So, I ended up abandoning trying to get Jupyter to work because I found that Sherlock has a new service that makes using Jupyter notebooks easy.
https://www.sherlock.stanford.edu/docs/user-guide/ondemand/
yep that's exactly what I just referenced :)
Whoa. This is super simple. Thanks for this and for your quick responses!
Awesome! Since you are both happy (and I'm very glad to deprecate this tool for the better solution) I'm going to close the issue. Happy OnDemand-ing!
Hi Vanessa, Thank you for creating this tool on sherlock. I was following the instructions at https://vsoch.github.io/lessons/sherlock-jupyter/ and I run into the following issue. I'm hoping you can help me understand the problem. When I run start.sh I get the following errors, and when I try to open the notebook in my browser (using the following address) it fails. But the job is running.
[tdaley@sh-ln08 login /scratch/PI/whwong/tdaley/programs/forward]$ bash start.sh py3-jupyter /scratch/PI/whwong/tdaley/sgRNA/CRISPRa-sgRNA-determinants/deepLearningMixtureRegression/ == Finding Script == Looking for sbatches/sherlock/py3-jupyter.sbatch Script sbatches/sherlock/py3-jupyter.sbatch
== Checking for previous notebook == No existing py3-jupyter jobs found, continuing...
== Getting destination directory ==
== Uploading sbatch script == py3-jupyter.sbatch 100% 146 29.6KB/s 00:00
== Submitting sbatch == sbatch --job-name=py3-jupyter --partition=whwong --output=/home/users/tdaley/forward-util/py3-jupyter.sbatch.out --error=/home/users/tdaley/forward-util/py3-jupyter.sbatch.err --mem=20G --time=8:00:00 /home/users/tdaley/forward-util/py3-jupyter.sbatch 58668 "/scratch/PI/whwong/tdaley/sgRNA/CRISPRa-sgRNA-determinants/deepLearningMixtureRegression/" Submitted batch job 34562816
== View logs in separate terminal == ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.out ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.err
== Waiting for job to start, using exponential backoff == Attempt 0: not ready yet... retrying in 1.. Attempt 1: not ready yet... retrying in 2.. Attempt 2: not ready yet... retrying in 4.. Attempt 3: not ready yet... retrying in 8.. Attempt 4: not ready yet... retrying in 16.. Attempt 5: not ready yet... retrying in 32.. Attempt 6: resources allocated to sh-08-13!.. sh-08-13 sh-08-13 notebook running on sh-08-13
== Setting up port forwarding == ssh -L 58668:localhost:58668 sherlock ssh -L 58668:localhost:58668 -N sh-08-13 & Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed. == Connecting to notebook == [I 18:10:27.968 NotebookApp] Writing notebook server cookie secret to /tmp/jupyter/notebook_cookie_secret [I 18:10:29.512 NotebookApp] Serving notebooks from local directory: /scratch/groups/whwong/tdaley/sgRNA/CRISPRa-sgRNA-determinants/deepLearningMixtureRegression [I 18:10:29.512 NotebookApp] 0 active kernels [I 18:10:29.512 NotebookApp] The Jupyter Notebook is running at: http://localhost:58667/ [I 18:10:29.512 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). slurmstepd: error: JOB 34562525 ON sh-08-13 CANCELLED AT 2018-12-22T18:14:40
== View logs in separate terminal == ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.out ssh sherlock cat /home/users/tdaley/forward-util/py3-jupyter.sbatch.err
== Instructions ==
[tdaley@sh-ln08 login /scratch/PI/whwong/tdaley/programs/forward]$ jobs 34562816 whwong py3-jupy tdaley R 2:51 1 sh-08-13
Thank you for your help and I apologize if I missed something super obvious.