ncar-xdev / jupyter-forward

Launch Jupyter Lab over SSH with a single command!
https://jupyter-forward.readthedocs.io/
Apache License 2.0
31 stars 12 forks source link

Hanging on parsing #148

Closed rsignell-usgs closed 2 years ago

rsignell-usgs commented 2 years ago

Tried jupyter-forward for the first time on our USGS HPC machine. Was looking good, but then hung on the line:

 Parsing $HOME/.jupyter_forward/log_2022-03-21T07-51-39.txt log file on denali.cr.usgs.gov for jupyter information

Full output:

$ jupyter-forward rsignell@denali --launch-command "srun -A woodshole -N 1 -t 00:30:00" --conda-env pangeo --port 8889
**************************************************** Authenticating ****************************************************Authenticating user (rsignell) from client (IGSAGIEGLTRSI10) to remote host (denali.cr.usgs.gov)
✅ The client is authenticated successfully
*********************************************** Verifying shell location ***********************************************✅ Using shell: /bin/bash
******************************************** Running jupyter sanity checks *********************************************/bin/bash -lc "source activate pangeo && which jupyter"
/home/rsignell/miniconda3/bin/jupyter
***************************************  Creating log file on denali.cr.usgs.gov ***************************************/bin/bash -lc "printenv TMPDIR"
/bin/bash -lc "printenv HOME"
/home/rsignell
/bin/bash -lc "touch $HOME/foobar && rm -rf $HOME/foobar && echo '$HOME is WRITABLE' || echo '$HOME is NOT WRITABLE'"
/home/rsignell is WRITABLE
/bin/bash -lc "mkdir -p $HOME/.jupyter_forward"
✅ Log directory is set to $HOME/.jupyter_forward
/bin/bash -lc "touch $HOME/.jupyter_forward/log_2022-03-21T07-51-39.txt"
✅ Log file is set to $HOME/.jupyter_forward/log_2022-03-21T07-51-39.txt
********************************************** Preparing Batch Job script **********************************************/bin/bash -lc "echo -n '#!' > $HOME/.jupyter_forward/batch_job_script_2022-03-21T07-51-39"
/bin/bash -lc "echo /bin/bash -l >> $HOME/.jupyter_forward/batch_job_script_2022-03-21T07-51-39"
/bin/bash -lc "echo 'source activate pangeo && jupyter lab --no-browser --ip=\$(hostname -f) > $HOME/.jupyter_forward/log_2022-03-21T07-51-39.txt 2>&1' >> $HOME/.jupyter_forward/batch_job_script_2022-03-21T07-51-39"
/bin/bash -lc "chmod +x $HOME/.jupyter_forward/batch_job_script_2022-03-21T07-51-39"
✅ Batch Job script resides in $HOME/.jupyter_forward/batch_job_script_2022-03-21T07-51-39
************************************************ Launching Jupyter Lab *************************************************/bin/bash -c "srun -p compute -N 1 -t 00:30:00 $HOME/.jupyter_forward/batch_job_script_2022-03-21T07-51-39"
🌨   Parsing $HOME/.jupyter_forward/log_2022-03-21T07-51-39.txt log file on denali.cr.usgs.gov for jupyter information   **************************** ❌ Terminated the network 📡 connection to denali.cr.usgs.gov *****************************

The log file looks like:


#!/bin/bash -l
source activate pangeo && jupyter lab --no-browser --ip=$(hostname -f) > /home/rsignell/.jupyter_forward/log_2022-03-21T
08-27-02.txt 2>&1
kmpaul commented 2 years ago

Hey, @rsignell-usgs! What's in the $HOME/.jupyter_forward/batch_job_script_2022-03-21T07-51-39 file on denali?

kmpaul commented 2 years ago

And what about $HOME/.jupyter_forward/log_2022-03-21T07-51-39.txt?

rsignell-usgs commented 2 years ago

I just tried running again with:

 jupyter-forward rsignell@denali --launch-command "srun -A woodshole  srun -A woodshole -N 4 --mem=32G -N 1 -t 00:30:00" --conda-env pangeo --port 8889

on our USGS HPC system "denali" and got same hanging-on-parsing problem.

To answer your questions, the log file is empty, and batch file is:

(pangeo) rsignell@nid00415:~/.jupyter_forward> ls -sailrt
total 60
180144136509293350 4 drwxr-x--- 34 rsignell users 4096 Mar 22 09:34 ..
180144560469448371 0 -rw-r--r--  1 rsignell users    0 Mar 22 09:34 log_2022-03-22T10-34-24.txt
180144314163144374 4 drwxr-xr-x  2 rsignell users 4096 Mar 22 09:34 .
180144560469448372 4 -rwxr-xr-x  1 rsignell users  153 Mar 22 09:34 batch_job_script_2022-03-22T10-34-24

(pangeo) rsignell@nid00415:~/.jupyter_forward> more batch_job_script_2022-03-22T10-34-24
#!/bin/bash -l
source activate pangeo && jupyter lab --no-browser --ip=$(hostname -f) > /home/rsignell/.jupyter_forward/log_2022-03-22T
10-34-24.txt 2>&1
(pangeo) rsignell@nid00415:~/.jupyter_forward>
andersy005 commented 2 years ago

I'm not 💯 sure but something (on the compute node) could be going wrong here

source activate pangeo && jupyter lab --no-browser --ip=$(hostname -f)

Can you confirm that everything works when launching the server on the login node?

 jupyter-forward rsignell@denali  --conda-env pangeo --port 8889
rsignell-usgs commented 2 years ago

@andersy005, I tried it -- no go:

$  jupyter-forward rsignell@denali  --conda-env pangeo --port 8889
**************************************************** Authenticating ****************************************************
Authenticating user (rsignell) from client (IGSAGIEGLTRSI10) to remote host (denali.cr.usgs.gov)
✅ The client is authenticated successfully
*********************************************** Verifying shell location ***********************************************
✅ Using shell: /bin/bash
******************************************** Running jupyter sanity checks *********************************************
/bin/bash -lc "source activate pangeo && which jupyter"
/home/rsignell/miniconda3/bin/jupyter
***************************************  Creating log file on denali.cr.usgs.gov ***************************************
/bin/bash -lc "printenv TMPDIR"
/bin/bash -lc "printenv HOME"
/home/rsignell
/bin/bash -lc "touch $HOME/foobar && rm -rf $HOME/foobar && echo '$HOME is WRITABLE' || echo '$HOME is NOT WRITABLE'"
/home/rsignell is WRITABLE
/bin/bash -lc "mkdir -p $HOME/.jupyter_forward"
✅ Log directory is set to $HOME/.jupyter_forward
/bin/bash -lc "touch $HOME/.jupyter_forward/log_2022-03-22T12-40-09.txt"
✅ Log file is set to $HOME/.jupyter_forward/log_2022-03-22T12-40-09.txt
nid00038
************************************************ Launching Jupyter Lab *************************************************
/bin/bash -c "source activate pangeo && jupyter lab --no-browser --ip=nid00038 > $HOME/.jupyter_forward/log_2022-03-22T12-40-09.txt 2>&1"

And then it says it's parsing the log file, but the log file is empty.

I'd be happy to screenshare/debug this.

andersy005 commented 2 years ago

Thanks for trying!

I'd be happy to screenshare/debug this.

I am available in the afternoon (2pm - 3pm Mountain Time).

In the meantime, do you mind downgrading to jupyter-forward v2022.1.15 and trying again? I'm starting to think that the log redirection changes introduced in #138 may have inadvertently broken other things.

rsignell-usgs commented 2 years ago

Tried the 2022.1.15 version, but still hanging on: ⛈ Parsing $HOME/.jupyter_forward/log.2022-03-22T13-12-21 log file on denali.cr.usgs.gov for jupyter information

rsignell-usgs commented 2 years ago

I could meet at 4pm ET = 2pm Mountain. I'll send you a calendar invite.

mnlevy1981 commented 2 years ago

I'd be happy to join the call as well, if that would be useful.

andersy005 commented 2 years ago

@rsignell-usgs

I could meet at 4pm ET = 2pm Mountain. I'll send you a calendar invite.

Did you send the invite to my UCAR address? I'm no longer employed at NCAR and that email account has been disabled

andersy005 commented 2 years ago

We could use https://whereby.com/pangeo

kmpaul commented 2 years ago

How did this go?

andersy005 commented 2 years ago

Unfortunately, we didn't make any progress (neither v2022.1.15 nor v2022.2.25 worked). It's worth testing this on cheyenne/casper or some other machine to see if one observes the same behavior.

kmpaul commented 2 years ago

I'll try it in a few minutes. My daughter is sick today, though, so I may not have much time.

kmpaul commented 2 years ago

I can verify that it hangs when trying to launch a JLab on Cheyenne.

kmpaul commented 2 years ago

I am going to try looking more closely at this today, assuming I have time.

andersy005 commented 2 years ago

I'll try it in a few minutes. My daughter is sick today, though, so I may not have much time.

Thank you for looking into this, @kmpaul! I hope she is feeling better

kmpaul commented 2 years ago

Thanks, @andersy005! My daughter bounced back to life last night at 3:30am. 😄 So, I'm a bit tired, but everyone is well.

rsignell-usgs commented 2 years ago

I have some time now (and for the next few hours) if we want to try live debug

kmpaul commented 2 years ago

I'm running to get some lunch. Let me ping you in about 10 minutes.

kmpaul commented 2 years ago

Ok. I'm now looking at this. I'm not sure I know how to diagnose the problem, yet. Hold on.

kmpaul commented 2 years ago

I've reproduced the problem on my end, and it appears to be caused by jupyter lab --no-browser command ending early. So, when the parsing of the log file starts, the log file is empty and remains empty.

rsignell-usgs commented 2 years ago

@kmpaul that's great news for me -- means it's not just a weird problem with our system! So is there anything I can do to help, or just wait?

kmpaul commented 2 years ago

Well, at this point, I think it's a "just wait" thing. I can't seem to figure out why the jupyter lab command is exiting early. I'm trying to reproduce the entire workflow with straight SSH right now.

andersy005 commented 2 years ago

@rsignell-usgs and @kmpaul, which version of jupyterlab are you running on the remote machine? I'm asking because I am unable to reproduce the issue using the following versions:

$ jupyter --version
Selected Jupyter core packages...
IPython          : 8.0.1
ipykernel        : 6.9.1
ipywidgets       : 7.6.5
jupyter_client   : 7.1.2
jupyter_core     : 4.9.2
jupyter_server   : 1.13.5
jupyterlab       : 3.2.9
nbclient         : 0.5.11
nbconvert        : 6.4.2
nbformat         : 5.1.3
notebook         : 6.4.8
qtconsole        : not installed
traitlets        : 5.1.1
kmpaul commented 2 years ago

I am using:

(pangeo) kpaul@cheyenne2:~> jupyter --version
Selected Jupyter core packages...
IPython          : 8.1.1
ipykernel        : 6.9.2
ipywidgets       : not installed
jupyter_client   : 7.1.2
jupyter_core     : 4.9.2
jupyter_server   : 1.15.6
jupyterlab       : 3.3.2
nbclient         : 0.5.13
nbconvert        : 6.4.4
nbformat         : 5.2.0
notebook         : 6.4.10
qtconsole        : not installed
traitlets        : 5.1.1
kmpaul commented 2 years ago

What version of paramiko and fabric are you using, @andersy005?

kmpaul commented 2 years ago

I am using fabric 2.6.0 and paramiko 2.10.3.

andersy005 commented 2 years ago
# Name                    Version                   Build  Channel
fabric                    2.6.0              pyhd8ed1ab_1    conda-forge
paramiko                  2.10.3             pyhd8ed1ab_0    conda-forge
kmpaul commented 2 years ago

What shell are you using on the remote system, @andersy005?

kmpaul commented 2 years ago

For me, it seems to be due to the Jupyter launch command being run with /bin/bash -c instead of /bin/bash -lc.

kmpaul commented 2 years ago

If I run the full command:

/bin/bash -c "conda activate pangeo && jupyter lab --no-browser --ip=cheyenne2.cheyenne.ucar.edu > $TMPDIR/.jupyter_forward/log_2022-03-24T13-28-45.txt 2>&1"

with vanilla SSH, it errors out saying that it can't find conda. But if I change the command to:

/bin/bash -lc "conda activate pangeo && jupyter lab --no-browser --ip=cheyenne2.cheyenne.ucar.edu > $TMPDIR/.jupyter_forward/log_2022-03-24T13-28-45.txt 2>&1"

It works fine. Makes sense. You need a login shell to activate conda, right?

andersy005 commented 2 years ago

I am using bash. I upgraded my environment on the remote machine to the latest versions, and I started running into the same issue.

$ jupyter --version
Selected Jupyter core packages...
IPython          : 8.1.1
ipykernel        : 6.9.2
ipywidgets       : 7.7.0
jupyter_client   : 7.1.2
jupyter_core     : 4.9.2
jupyter_server   : 1.15.6
jupyterlab       : 3.3.2
nbclient         : 0.5.13
nbconvert        : 6.4.4
nbformat         : 5.2.0
notebook         : 6.4.10
qtconsole        : not installed
traitlets        : 5.1.1

It appears that there may have been some recent changes in jupyterlab or jupyter_server or some other package that are causing this failure.

kmpaul commented 2 years ago

That's very weird. How is it that running /bin/bash -c "conda activate" works for you without the -lc flag?

mnlevy1981 commented 2 years ago

That's very weird. How is it that running /bin/bash -c "conda activate" works for you without the -lc flag?

I thought

  1. The -l flag controls whether or not .bash_profile is sourced, but .bashrc is always sourced
  2. conda was initialized in .bashrc

For me, I know (2) is true -- the following is in .bashrc:

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/glade/work/mlevy/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/glade/work/mlevy/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/glade/work/mlevy/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/glade/work/mlevy/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<
kmpaul commented 2 years ago

Okay. I just reverted my versions of jupyter to the following:

Selected Jupyter core packages...
IPython          : 8.1.1
ipykernel        : 6.9.2
ipywidgets       : not installed
jupyter_client   : 7.1.2
jupyter_core     : 4.9.2
jupyter_server   : 1.13.5
jupyterlab       : 3.2.9
nbclient         : 0.5.13
nbconvert        : 6.4.4
nbformat         : 5.2.0
notebook         : 6.4.10
qtconsole        : not installed
traitlets        : 5.1.1

And it still fails for me. So, I don't think it has anything to do with the versions of jupyter on the remote system.

kmpaul commented 2 years ago

@mnlevy1981: For me, I am using the system conda module, but the module load conda is being done in .bashrc. So, I think we are both initializing conda in the same place.

However, it still doesn't explain my experiment. Try running (from your laptop) the following command:

ssh user@cheyenne.ucar.edu '/bin/bash -c "conda activate"'

and then run

ssh user@cheyenne.ucar.edu '/bin/bash -lc "conda activate"'

The first version (without the -l) fails for me with:

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.
kmpaul commented 2 years ago

Okay. Just so we can verify what SSH is doing, I did an experiment to see what files are being executed on the remote system. I added echo "READING BASHRC FILE" to my ~/.bashrc and echo "READING PROFILE" to my ~/.profile. Then I ran the previous two commands:

~> ssh kpaul@cheyenne.ucar.edu '/bin/bash -c "conda activate"'
Token_Response: 
READING BASHRC FILE

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

And then:

~> ssh kpaul@cheyenne.ucar.edu '/bin/bash -lc "conda activate"'
Token_Response: 
READING BASHRC FILE
READING BASHRC FILE
READING PROFILE
kmpaul commented 2 years ago

So, it is loading the .bashrc file both times. (And, just for clarification, my .profile file is basically empty except for the echo command.) With the -l option, it loads it twice.

So, something tells me that this must have something to do with using LMOD instead of a hand-installed miniconda.

mnlevy1981 commented 2 years ago

@mnlevy1981: For me, I am using the system conda module, but the module load conda is being done in .bashrc. So, I think we are both initializing conda in the same place.

However, it still doesn't explain my experiment. Try running (from your laptop) the following command:

ssh user@cheyenne.ucar.edu '/bin/bash -c "conda activate"'

and then run

ssh user@cheyenne.ucar.edu '/bin/bash -lc "conda activate"'

Oh, weird - I get the errors with -c but not with -lc as well... and I get the same behavior (-lc works, -c does not) on casper, which is really bizarre because I definitely tested the latest jupyter-forward code by running it on casper.

Regardless, isn't the latest version using -lc for everything except csh and tcsh? https://github.com/ncar-xdev/jupyter-forward/blob/main/jupyter_forward/core.py#L127

        if 'csh' in self.shell:
            command = f'''{self.shell} -c "{command}"'''
        else:
            command = f'''{self.shell} -lc "{command}"'''

Unless the initial conda call isn't using run_command...

kmpaul commented 2 years ago

No. There is a separate call on line 198:

https://github.com/ncar-xdev/jupyter-forward/blob/8e34462600699cae6e4a4bacbd31bfd35ab7af29/jupyter_forward/core.py#L198

kmpaul commented 2 years ago

@mnlevy1981: Unless the initial conda call isn't using run_command...

Yep. You guessed it. 😄

kmpaul commented 2 years ago

Seems like line 198 should be refactored to use run_command with the appropriate kwargs.

mnlevy1981 commented 2 years ago

No. There is a separate call on line 198:

https://github.com/ncar-xdev/jupyter-forward/blob/8e34462600699cae6e4a4bacbd31bfd35ab7af29/jupyter_forward/core.py#L198

So it looks like the solution is to modify that line

-  self.session.run(f'{self.shell} -c "{command}"', asynchronous=True, pty=True, echo=True)
+  self.run_command(command, asynchronous=True, pty=True, echo=True)

?

kmpaul commented 2 years ago

Yeah. 😄

kmpaul commented 2 years ago

I'm glad you were able to reproduce the error without using LMOD. I was starting to think it had to have something to do with that... Phew!

kmpaul commented 2 years ago

@mnlevy1981: I don't think that change will work because of the asynchronous=True argument. I think you cannot use run_command in this case.

kmpaul commented 2 years ago

Or, rather, you need to refactor the run_command to work with asynchronous execution.

mnlevy1981 commented 2 years ago

Another option is to do something like

        if 'csh' in self.shell:
            self.run_shell = f"{self.shell} -c"
        else:
            self.run_shell = f"{self.shell} -lc"

in _check_shell() and then both run_command() and _launch_jupyter() could rely on f'{self.run_shell} "{command}"'

mnlevy1981 commented 2 years ago

(or a self._get_run_command() function that returns the proper string)