hytest-org / hytest

https://hytest-org.github.io/hytest/
22 stars 12 forks source link

Problem using start script on the HPC #400

Closed amsnyder closed 11 months ago

amsnyder commented 11 months ago

Discussed in https://github.com/hytest-org/hytest/discussions/398

Originally posted by **mirizarry-ortiz** October 27, 2023 Hi! I followed the instructions here: https://hytest-org.github.io/hytest/environment_set_up/StartScript.html to start the jupyter server on denali and I am having some trouble. First, I do not get a 127.0.0.1 URL when I run jupyter-start.sh and instead I get this error: ![image](https://github.com/hytest-org/hytest/assets/69687785/31f97117-27ca-4ebd-b5a7-28718225bd46) My local terminal says this: ![image](https://github.com/hytest-org/hytest/assets/69687785/5d21eb8d-8c0f-4a32-827d-9d368ae8b76d) I was able to start a jupyter server via an older version of the start_jupyter script but it is quite different from the current script and I cannot get the Dask functionality to work on my notebooks when using the old script. Can anyone help? Any ideas of what might be going on? The only modifications I made on jupyter-start.sh were these: ![image](https://github.com/hytest-org/hytest/assets/69687785/70cbbdd9-7852-4d44-97ae-d5397daae03a) I have been trying to get this working for a few weeks now but I always get the same error message for srun. I've been trying to understand the error message and all I've come up is that it happens when one is trying to run a job within a job. I wonder if it has to do with the fact that we are doing two calls to srun (srun --pty bash within the salloc command) and then at the end of jupyter-start.sh we call srun again: srun jupyter lab --ip '*' --no-browser --port $JPORT --notebook-dir $PWD
amsnyder commented 11 months ago

Hi @mirizarry-ortiz - I am transferring this from a discussion to an issue because it was a bug we needed to fix. And I'll copy my response here for good record-keeping:

thank you for raising this and doing some investigative work! You were right that the issue was happening because of the double srun commands. The srun inside of our start script needed to be removed. I fixed this with https://github.com/hytest-org/hytest/pull/399. Could you pull the latest copy of the repository onto Denali and try again? I think it will work for you now.

Could you let me know if this is working for you once you get a chance to test?