Large notebook fails to save some of the time

fperez commented 3 years ago

This Gist contains a notebook that, especially for @tsnow03, often/most of the time fails to save on our hub. But the behavior is strange as there seems to be interactions with client-side issues. Some of the things we've learned:

For @tsnow03, the Binder link at the gist above (identical notebook) does save normally. This seems to indicate the problem comes from something in our own hub.
For @tsnow03, it does work fine when saving it through Classic, the failure is only seen on Lab.
But for @fperez, it does save also in our hub, though sometimes very slowly and with poor UI responsiveness.
For @fperez, running the gist locally at home (Mac Mini, 64 GB RAM) works perfectly and saving is very fast.
With this notebook open, JupyterLab gets very sluggish for all of us.

fperez commented 3 years ago

To add some info - while for me it saves on our hub, I do get a rather poorly responsive UI when I open this notebook. The Lab UI lags, menus don't always open, the icon hovers (like background shading) don't necessarily update as I move the mouse, and I've gotten some times Firefox popping up its warning about "this page is slowing down firefox, click here to stop it".

I don't see that behavior when I access the same gist either locally or on a Binder run (by using the binder button on the gist).

So it seems that this particular notebook, on our hub, creates significant pressure on the client. I'm wondering if it then becomes a difference of home systems, where my machine due to having more RAM/cores manages to squeeze through, while for @tsnow03 (8GB RAM) it's slow enough for something to completely time out.

That could explain what we're seeing, though still requires a fix: the fact it works for @tsnow03 on Binder shows that it can work, and I still see this extra pressure on the client. So there's still something happening on our backend with this notebook.

@consideRatio does this trigger any ideas? Filesystem performance in home directory storage?

consideRatio commented 3 years ago

Thank you @fperez and @tsnow03 for debugging this, this is very helpful!

These are my suspicions at the moment.

The .ipynb format is JSON, and to read JSON you need to load all of the JSON, making a large .ipynb file a bit troublesome to work with.
The --collaborative=true flag we have enabled could perhaps have degraded the performance related to large files?
The networking, which in our setup involves: An AWS load balancer, Traefik v2 in the autohttps pod in our k8s cluster, and configurable-http-proxy running in the proxy pod of our k8s cluster. Is some part of this networking cutting connections from the browser that wants to save a large .ipynb file? Intermittent success in this could also cause something to block in the JupyterLab UI I presume.

I'm quite confident that it isn't related to us having a NFS filesystem or it being slow etc.

I've tried debugging the networking, but I fail to draw a conclusion. There are several components to consider.

The AWS load balancer we use. There may be a relevant annotation to use on our k8s Service that makes AWS provision a load balancer for us: https://kubernetes.io/docs/concepts/services-networking/service/#connection-draining-on-aws
The autohttps pod in our k8s cluster running Traefik (Z2JHs automated TLS termination system).
The proxy pod in our k8s cluster running configurable-http-proxy

Out of these, I suspect an issue stem from the AWS load balancer or the autohttps pod if something is problematic.

I'll start applying some configuration on the AWS load balancer for now to see if that can help.

fperez commented 3 years ago

Thanks @consideRatio for the debugging effort! Your point about the collaborative flag is an interesting one - we're not yet using that feature all the time, it might be worth testing whether it plays a role by turning it temporarily off. If it does show an impact that would also be valuable knowledge to communicate to the JLab team...

fperez commented 3 years ago

Interesting input from @tsnow03 after further debugging - she is seeing very slow saves even on a 2nd computer with 192GB of RAM, so memory pressure is certainly not an issue on that system: "I just tried this on my desktop (Chrome) with 192 GB of RAM and the first save took about 5 min to initiate once I started clicking on it to do so. It took about a min to save. Now I've tried saving again and I waited 15 min for it to start saving (pushing save intermittently) and it didn't. I'm experiencing the same delay and slow save on Safari and my tab shutdown and restarted when it did save. All other notebooks start saving and finish saving nearly instantaneously."

I am quite puzzled by this one...

consideRatio commented 3 years ago

Okay the current status is now that:

I've applied a configuration to the AWS load balancer that will make idle connection remain open for even longer. I'm not at all sure if this is relevant though.
I've disabled the collaborative feature

To note a difference between the latter point about collaborative, one need to restart ones server via https://hub.jupytearth.org/hub/home

Let's see if this makes a difference.

fperez commented 3 years ago

Thanks @consideRatio! Too early to tell, but it seems more responsive to me. I was getting successful saves with less lag than @tsnow03, but still with quite a bit of lag and the occasional Firefox high usage warning. This time it was much, much faster.

consideRatio commented 3 years ago

@fperez aha nice! A change of relevance would then perhaps be to try starting jupyterlab locally with and without --collaborative and see if that seems to make a difference as well, then we have excluded the JupyterHub networking complexity as well.

fperez commented 3 years ago

I think that's it!! I just tested, side-by-side, JLab 3.1.10 with and without --collaborative, and the version with it gets extremely laggy with that gist, and sometimes fails to save.

If @tsnow03 can confirm that now that on our hub the collaborative feature is off it also works for her without pain, we can then report this over to the JLab team.

fperez commented 3 years ago

Actually I'm going to open a companion issue right away in Lab - the behavior I'm seeing locally is clearly a problem and it's pretty evident the problem is RTC: this is 100% local run, no JupyterHub, fast machine with gobs of RAM. Might as well report it now.

fperez commented 3 years ago

Let's then leave RTC off for now - we can explore turning it on selectively in our spawner later so we only use it when absolutely necessary. Hopefully the Lab team will find the reproducible example enough to make progress.

@tsnow03 just confirmed that saving worked for her too, so we're done here. Passing the ball over to the Lab team :)

consideRatio commented 3 years ago

Wieeeeeeeeeeeeeeeeeeeeeeeeeeeeee!!! Nice work narrowing this down @fperez @tsnow03!!

I'm very happy this is no longer a "It could be anything really..." kind of situation :D

fperez commented 3 years ago

Yup, was a hard one to debug, great job @consideRatio and amazing patience by @tsnow03 who dealt with this for weeks without complaint and painful manual workarounds. Very sorry to have put you through this!

tsnow03 commented 3 years ago

Yes great job @consideRatio! And thanks for to both of you for your help in addressing this. No worries on my end. I'm happy it lead to some interesting finds with our setup!

fperez commented 3 years ago

Closing this since the problem is really in Lab. Good job everyone!

fperez commented 3 years ago

For reference, the Lab team now has PR #11003 that should address this issue, we can test it once it gets merged and goes into the next release (likely 3.1.11).

yuvipanda commented 2 years ago

I think it has been released now, time for JMTE to try RTC again? :)

fperez commented 2 years ago

Yes! Let's :)

consideRatio commented 2 years ago

@fperez @yuvipanda this was enabled! It now works and with jupyterlab-link-share as well!

fperez commented 2 years ago

Totally awesome, thx @consideRatio!! I just tested it this morning with some collaborators and it worked very smoothly. Thank you so much!!

pangeo-data / jupyter-earth

Large notebook fails to save some of the time #78