Closed fperez closed 3 years ago
To add some info - while for me it saves on our hub, I do get a rather poorly responsive UI when I open this notebook. The Lab UI lags, menus don't always open, the icon hovers (like background shading) don't necessarily update as I move the mouse, and I've gotten some times Firefox popping up its warning about "this page is slowing down firefox, click here to stop it".
I don't see that behavior when I access the same gist either locally or on a Binder run (by using the binder button on the gist).
So it seems that this particular notebook, on our hub, creates significant pressure on the client. I'm wondering if it then becomes a difference of home systems, where my machine due to having more RAM/cores manages to squeeze through, while for @tsnow03 (8GB RAM) it's slow enough for something to completely time out.
That could explain what we're seeing, though still requires a fix: the fact it works for @tsnow03 on Binder shows that it can work, and I still see this extra pressure on the client. So there's still something happening on our backend with this notebook.
@consideRatio does this trigger any ideas? Filesystem performance in home directory storage?
Thank you @fperez and @tsnow03 for debugging this, this is very helpful!
These are my suspicions at the moment.
--collaborative=true
flag we have enabled could perhaps have degraded the performance related to large files?autohttps
pod in our k8s cluster, and configurable-http-proxy running in the proxy
pod of our k8s cluster. Is some part of this networking cutting connections from the browser that wants to save a large .ipynb file? Intermittent success in this could also cause something to block in the JupyterLab UI I presume.I'm quite confident that it isn't related to us having a NFS filesystem or it being slow etc.
I've tried debugging the networking, but I fail to draw a conclusion. There are several components to consider.
autohttps
pod in our k8s cluster running Traefik (Z2JHs automated TLS termination system).proxy
pod in our k8s cluster running configurable-http-proxyOut of these, I suspect an issue stem from the AWS load balancer or the autohttps pod if something is problematic.
I'll start applying some configuration on the AWS load balancer for now to see if that can help.
Thanks @consideRatio for the debugging effort! Your point about the collaborative flag is an interesting one - we're not yet using that feature all the time, it might be worth testing whether it plays a role by turning it temporarily off. If it does show an impact that would also be valuable knowledge to communicate to the JLab team...
Interesting input from @tsnow03 after further debugging - she is seeing very slow saves even on a 2nd computer with 192GB of RAM, so memory pressure is certainly not an issue on that system: "I just tried this on my desktop (Chrome) with 192 GB of RAM and the first save took about 5 min to initiate once I started clicking on it to do so. It took about a min to save. Now I've tried saving again and I waited 15 min for it to start saving (pushing save intermittently) and it didn't. I'm experiencing the same delay and slow save on Safari and my tab shutdown and restarted when it did save. All other notebooks start saving and finish saving nearly instantaneously."
I am quite puzzled by this one...
Okay the current status is now that:
To note a difference between the latter point about collaborative, one need to restart ones server via https://hub.jupytearth.org/hub/home
Let's see if this makes a difference.
Thanks @consideRatio! Too early to tell, but it seems more responsive to me. I was getting successful saves with less lag than @tsnow03, but still with quite a bit of lag and the occasional Firefox high usage warning. This time it was much, much faster.
@fperez aha nice! A change of relevance would then perhaps be to try starting jupyterlab locally with and without --collaborative
and see if that seems to make a difference as well, then we have excluded the JupyterHub networking complexity as well.
I think that's it!! I just tested, side-by-side, JLab 3.1.10 with and without --collaborative
, and the version with it gets extremely laggy with that gist, and sometimes fails to save.
If @tsnow03 can confirm that now that on our hub the collaborative feature is off it also works for her without pain, we can then report this over to the JLab team.
Actually I'm going to open a companion issue right away in Lab - the behavior I'm seeing locally is clearly a problem and it's pretty evident the problem is RTC: this is 100% local run, no JupyterHub, fast machine with gobs of RAM. Might as well report it now.
Let's then leave RTC off for now - we can explore turning it on selectively in our spawner later so we only use it when absolutely necessary. Hopefully the Lab team will find the reproducible example enough to make progress.
@tsnow03 just confirmed that saving worked for her too, so we're done here. Passing the ball over to the Lab team :)
Wieeeeeeeeeeeeeeeeeeeeeeeeeeeeee!!! Nice work narrowing this down @fperez @tsnow03!!
I'm very happy this is no longer a "It could be anything really..." kind of situation :D
Yup, was a hard one to debug, great job @consideRatio and amazing patience by @tsnow03 who dealt with this for weeks without complaint and painful manual workarounds. Very sorry to have put you through this!
Yes great job @consideRatio! And thanks for to both of you for your help in addressing this. No worries on my end. I'm happy it lead to some interesting finds with our setup!
Closing this since the problem is really in Lab. Good job everyone!
For reference, the Lab team now has PR #11003 that should address this issue, we can test it once it gets merged and goes into the next release (likely 3.1.11).
I think it has been released now, time for JMTE to try RTC again? :)
Yes! Let's :)
@fperez @yuvipanda this was enabled! It now works and with jupyterlab-link-share as well!
Totally awesome, thx @consideRatio!! I just tested it this morning with some collaborators and it worked very smoothly. Thank you so much!!
This Gist contains a notebook that, especially for @tsnow03, often/most of the time fails to save on our hub. But the behavior is strange as there seems to be interactions with client-side issues. Some of the things we've learned: