Closed rsignell-usgs closed 2 years ago
Hey, @rsignell-usgs! What's in the $HOME/.jupyter_forward/batch_job_script_2022-03-21T07-51-39
file on denali?
And what about $HOME/.jupyter_forward/log_2022-03-21T07-51-39.txt
?
I just tried running again with:
jupyter-forward rsignell@denali --launch-command "srun -A woodshole srun -A woodshole -N 4 --mem=32G -N 1 -t 00:30:00" --conda-env pangeo --port 8889
on our USGS HPC system "denali" and got same hanging-on-parsing problem.
To answer your questions, the log file is empty, and batch file is:
(pangeo) rsignell@nid00415:~/.jupyter_forward> ls -sailrt
total 60
180144136509293350 4 drwxr-x--- 34 rsignell users 4096 Mar 22 09:34 ..
180144560469448371 0 -rw-r--r-- 1 rsignell users 0 Mar 22 09:34 log_2022-03-22T10-34-24.txt
180144314163144374 4 drwxr-xr-x 2 rsignell users 4096 Mar 22 09:34 .
180144560469448372 4 -rwxr-xr-x 1 rsignell users 153 Mar 22 09:34 batch_job_script_2022-03-22T10-34-24
(pangeo) rsignell@nid00415:~/.jupyter_forward> more batch_job_script_2022-03-22T10-34-24
#!/bin/bash -l
source activate pangeo && jupyter lab --no-browser --ip=$(hostname -f) > /home/rsignell/.jupyter_forward/log_2022-03-22T
10-34-24.txt 2>&1
(pangeo) rsignell@nid00415:~/.jupyter_forward>
I'm not 💯 sure but something (on the compute node) could be going wrong here
source activate pangeo && jupyter lab --no-browser --ip=$(hostname -f)
Can you confirm that everything works when launching the server on the login node?
jupyter-forward rsignell@denali --conda-env pangeo --port 8889
@andersy005, I tried it -- no go:
$ jupyter-forward rsignell@denali --conda-env pangeo --port 8889
**************************************************** Authenticating ****************************************************
Authenticating user (rsignell) from client (IGSAGIEGLTRSI10) to remote host (denali.cr.usgs.gov)
✅ The client is authenticated successfully
*********************************************** Verifying shell location ***********************************************
✅ Using shell: /bin/bash
******************************************** Running jupyter sanity checks *********************************************
/bin/bash -lc "source activate pangeo && which jupyter"
/home/rsignell/miniconda3/bin/jupyter
*************************************** Creating log file on denali.cr.usgs.gov ***************************************
/bin/bash -lc "printenv TMPDIR"
/bin/bash -lc "printenv HOME"
/home/rsignell
/bin/bash -lc "touch $HOME/foobar && rm -rf $HOME/foobar && echo '$HOME is WRITABLE' || echo '$HOME is NOT WRITABLE'"
/home/rsignell is WRITABLE
/bin/bash -lc "mkdir -p $HOME/.jupyter_forward"
✅ Log directory is set to $HOME/.jupyter_forward
/bin/bash -lc "touch $HOME/.jupyter_forward/log_2022-03-22T12-40-09.txt"
✅ Log file is set to $HOME/.jupyter_forward/log_2022-03-22T12-40-09.txt
nid00038
************************************************ Launching Jupyter Lab *************************************************
/bin/bash -c "source activate pangeo && jupyter lab --no-browser --ip=nid00038 > $HOME/.jupyter_forward/log_2022-03-22T12-40-09.txt 2>&1"
And then it says it's parsing the log file, but the log file is empty.
I'd be happy to screenshare/debug this.
Thanks for trying!
I'd be happy to screenshare/debug this.
I am available in the afternoon (2pm - 3pm Mountain Time).
In the meantime, do you mind downgrading to jupyter-forward v2022.1.15
and trying again? I'm starting to think that the log redirection changes introduced in #138 may have inadvertently broken other things.
Tried the 2022.1.15 version, but still hanging on: ⛈ Parsing $HOME/.jupyter_forward/log.2022-03-22T13-12-21 log file on denali.cr.usgs.gov for jupyter information
I could meet at 4pm ET = 2pm Mountain. I'll send you a calendar invite.
I'd be happy to join the call as well, if that would be useful.
@rsignell-usgs
I could meet at 4pm ET = 2pm Mountain. I'll send you a calendar invite.
Did you send the invite to my UCAR address? I'm no longer employed at NCAR and that email account has been disabled
We could use https://whereby.com/pangeo
How did this go?
Unfortunately, we didn't make any progress (neither v2022.1.15
nor v2022.2.25
worked). It's worth testing this on cheyenne/casper or some other machine to see if one observes the same behavior.
I'll try it in a few minutes. My daughter is sick today, though, so I may not have much time.
I can verify that it hangs when trying to launch a JLab on Cheyenne.
I am going to try looking more closely at this today, assuming I have time.
I'll try it in a few minutes. My daughter is sick today, though, so I may not have much time.
Thank you for looking into this, @kmpaul! I hope she is feeling better
Thanks, @andersy005! My daughter bounced back to life last night at 3:30am. 😄 So, I'm a bit tired, but everyone is well.
I have some time now (and for the next few hours) if we want to try live debug
I'm running to get some lunch. Let me ping you in about 10 minutes.
Ok. I'm now looking at this. I'm not sure I know how to diagnose the problem, yet. Hold on.
I've reproduced the problem on my end, and it appears to be caused by jupyter lab --no-browser
command ending early. So, when the parsing of the log file starts, the log file is empty and remains empty.
@kmpaul that's great news for me -- means it's not just a weird problem with our system! So is there anything I can do to help, or just wait?
Well, at this point, I think it's a "just wait" thing. I can't seem to figure out why the jupyter lab
command is exiting early. I'm trying to reproduce the entire workflow with straight SSH right now.
@rsignell-usgs and @kmpaul, which version of jupyterlab
are you running on the remote machine? I'm asking because I am unable to reproduce the issue using the following versions:
$ jupyter --version
Selected Jupyter core packages...
IPython : 8.0.1
ipykernel : 6.9.1
ipywidgets : 7.6.5
jupyter_client : 7.1.2
jupyter_core : 4.9.2
jupyter_server : 1.13.5
jupyterlab : 3.2.9
nbclient : 0.5.11
nbconvert : 6.4.2
nbformat : 5.1.3
notebook : 6.4.8
qtconsole : not installed
traitlets : 5.1.1
I am using:
(pangeo) kpaul@cheyenne2:~> jupyter --version
Selected Jupyter core packages...
IPython : 8.1.1
ipykernel : 6.9.2
ipywidgets : not installed
jupyter_client : 7.1.2
jupyter_core : 4.9.2
jupyter_server : 1.15.6
jupyterlab : 3.3.2
nbclient : 0.5.13
nbconvert : 6.4.4
nbformat : 5.2.0
notebook : 6.4.10
qtconsole : not installed
traitlets : 5.1.1
What version of paramiko and fabric are you using, @andersy005?
I am using fabric 2.6.0
and paramiko 2.10.3
.
# Name Version Build Channel
fabric 2.6.0 pyhd8ed1ab_1 conda-forge
paramiko 2.10.3 pyhd8ed1ab_0 conda-forge
What shell are you using on the remote system, @andersy005?
For me, it seems to be due to the Jupyter launch command being run with /bin/bash -c
instead of /bin/bash -lc
.
If I run the full command:
/bin/bash -c "conda activate pangeo && jupyter lab --no-browser --ip=cheyenne2.cheyenne.ucar.edu > $TMPDIR/.jupyter_forward/log_2022-03-24T13-28-45.txt 2>&1"
with vanilla SSH, it errors out saying that it can't find conda
. But if I change the command to:
/bin/bash -lc "conda activate pangeo && jupyter lab --no-browser --ip=cheyenne2.cheyenne.ucar.edu > $TMPDIR/.jupyter_forward/log_2022-03-24T13-28-45.txt 2>&1"
It works fine. Makes sense. You need a login shell to activate conda, right?
I am using bash
. I upgraded my environment on the remote machine to the latest versions, and I started running into the same issue.
$ jupyter --version
Selected Jupyter core packages...
IPython : 8.1.1
ipykernel : 6.9.2
ipywidgets : 7.7.0
jupyter_client : 7.1.2
jupyter_core : 4.9.2
jupyter_server : 1.15.6
jupyterlab : 3.3.2
nbclient : 0.5.13
nbconvert : 6.4.4
nbformat : 5.2.0
notebook : 6.4.10
qtconsole : not installed
traitlets : 5.1.1
It appears that there may have been some recent changes in jupyterlab
or jupyter_server
or some other package that are causing this failure.
That's very weird. How is it that running /bin/bash -c "conda activate"
works for you without the -lc
flag?
That's very weird. How is it that running
/bin/bash -c "conda activate"
works for you without the-lc
flag?
I thought
-l
flag controls whether or not .bash_profile
is sourced, but .bashrc
is always sourcedconda
was initialized in .bashrc
For me, I know (2) is true -- the following is in .bashrc
:
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/glade/work/mlevy/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/glade/work/mlevy/miniconda3/etc/profile.d/conda.sh" ]; then
. "/glade/work/mlevy/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/glade/work/mlevy/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
Okay. I just reverted my versions of jupyter
to the following:
Selected Jupyter core packages...
IPython : 8.1.1
ipykernel : 6.9.2
ipywidgets : not installed
jupyter_client : 7.1.2
jupyter_core : 4.9.2
jupyter_server : 1.13.5
jupyterlab : 3.2.9
nbclient : 0.5.13
nbconvert : 6.4.4
nbformat : 5.2.0
notebook : 6.4.10
qtconsole : not installed
traitlets : 5.1.1
And it still fails for me. So, I don't think it has anything to do with the versions of jupyter on the remote system.
@mnlevy1981: For me, I am using the system conda module, but the module load conda
is being done in .bashrc
. So, I think we are both initializing conda in the same place.
However, it still doesn't explain my experiment. Try running (from your laptop) the following command:
ssh user@cheyenne.ucar.edu '/bin/bash -c "conda activate"'
and then run
ssh user@cheyenne.ucar.edu '/bin/bash -lc "conda activate"'
The first version (without the -l
) fails for me with:
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.
Okay. Just so we can verify what SSH is doing, I did an experiment to see what files are being executed on the remote system. I added echo "READING BASHRC FILE"
to my ~/.bashrc
and echo "READING PROFILE"
to my ~/.profile
. Then I ran the previous two commands:
~> ssh kpaul@cheyenne.ucar.edu '/bin/bash -c "conda activate"'
Token_Response:
READING BASHRC FILE
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.
And then:
~> ssh kpaul@cheyenne.ucar.edu '/bin/bash -lc "conda activate"'
Token_Response:
READING BASHRC FILE
READING BASHRC FILE
READING PROFILE
So, it is loading the .bashrc
file both times. (And, just for clarification, my .profile
file is basically empty except for the echo
command.) With the -l
option, it loads it twice.
So, something tells me that this must have something to do with using LMOD instead of a hand-installed miniconda
.
@mnlevy1981: For me, I am using the system conda module, but the
module load conda
is being done in.bashrc
. So, I think we are both initializing conda in the same place.However, it still doesn't explain my experiment. Try running (from your laptop) the following command:
ssh user@cheyenne.ucar.edu '/bin/bash -c "conda activate"'
and then run
ssh user@cheyenne.ucar.edu '/bin/bash -lc "conda activate"'
Oh, weird - I get the errors with -c
but not with -lc
as well... and I get the same behavior (-lc
works, -c
does not) on casper, which is really bizarre because I definitely tested the latest jupyter-forward
code by running it on casper
.
Regardless, isn't the latest version using -lc
for everything except csh
and tcsh
? https://github.com/ncar-xdev/jupyter-forward/blob/main/jupyter_forward/core.py#L127
if 'csh' in self.shell:
command = f'''{self.shell} -c "{command}"'''
else:
command = f'''{self.shell} -lc "{command}"'''
Unless the initial conda
call isn't using run_command
...
No. There is a separate call on line 198:
@mnlevy1981: Unless the initial
conda
call isn't usingrun_command
...
Yep. You guessed it. 😄
Seems like line 198 should be refactored to use run_command
with the appropriate kwargs
.
No. There is a separate call on line 198:
So it looks like the solution is to modify that line
- self.session.run(f'{self.shell} -c "{command}"', asynchronous=True, pty=True, echo=True)
+ self.run_command(command, asynchronous=True, pty=True, echo=True)
?
Yeah. 😄
I'm glad you were able to reproduce the error without using LMOD. I was starting to think it had to have something to do with that... Phew!
@mnlevy1981: I don't think that change will work because of the asynchronous=True
argument. I think you cannot use run_command
in this case.
Or, rather, you need to refactor the run_command
to work with asynchronous execution.
Another option is to do something like
if 'csh' in self.shell:
self.run_shell = f"{self.shell} -c"
else:
self.run_shell = f"{self.shell} -lc"
in _check_shell()
and then both run_command()
and _launch_jupyter()
could rely on f'{self.run_shell} "{command}"'
(or a self._get_run_command()
function that returns the proper string)
Tried
jupyter-forward
for the first time on our USGS HPC machine. Was looking good, but then hung on the line:Full output:
The log file looks like: