pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
704 stars 189 forks source link

Stability of Jupyter on GCP #246

Closed kaipak closed 6 years ago

kaipak commented 6 years ago

This isn't a specific issue per se, but general UX as I've been using Pangeo/GCP on a regular basis for the past few months.

Predictability with Jupyter for long periods of time continues to be an issue for me. I'll often leave a Jupyter session open crunching on some data to come back to find various errors in my browser (unexpected line reading JSON, network connection interrupted, unresponsive UI, etc.). Sometimes, refreshing my browser works, other times, the pod is completely toast and I have no recourse other than to restart the process. I find that the terminal sessions are most flaky of all. I do a light amount of coding in the environment and at least several times a day, VI will become completely unresponsive and I'll have to refresh which can leave my session in a funky state. Long running scripts can be a frustrating proposition.

Fortunately, I have noticed that Jupyter in general seems more stable (fewer crashes, better performance, etc.), and connection issues seems to be less of an affliction in notebooks. I wish I had some logs to share, but I'm still fairly green to how everything works and don't really know where to look. I've tried looking at logs in the Google console for the pod, but it's not obvious to me there's any indication of the above issues---they seem like fairly prosaic system messages to me.

I've been switching back and forth between the latest Firefox and Chrome and neither seems to do much better than the other. I know cycles are hard to come by around here, so I'd be happy to take a look if someone could point me in the right general direction.

mrocklin commented 6 years ago

In order to conserve resources we have set up pangeo.pydata.org to destroy pods for which the notebook server has been idle for twenty minutes.

For doing tests I recommend instead using your own deployment with the Dask helm chart (I'm happy to help walk you through this)

On Wed, May 9, 2018 at 9:37 AM, Kai Pak notifications@github.com wrote:

This isn't a specific issue per se, but general UX as I've been using Pangeo/GCP on a regular basis for the past few months.

Predictability with Jupyter for long periods of time continues to be an issue for me. I'll often leave a Jupyter session open crunching on some data to come back to find various errors in my browser (unexpected line reading JSON, network connection interrupted, unresponsive UI, etc.). Sometimes, refreshing my browser works, other times, the pod is completely toast and I have no recourse other than to restart the process. I find that the terminal sessions are most flaky of all. I do a light amount of coding in the environment and at least several times a day, VI will become completely unresponsive and I'll have to refresh which can leave my session in a funky state. Long running scripts can be a frustrating proposition.

Fortunately, I have noticed that Jupyter in general seems more stable (fewer crashes, better performance, etc.), and connection issues seems to be less of an affliction in notebooks. I wish I had some logs to share, but I'm still fairly green to how everything works and don't really know where to look. I've tried looking at logs in the Google console for the pod, but it's not obvious to me there's any indication of the above issues---they seem like fairly prosaic system messages to me.

I've been switching back and forth between the latest Firefox and Chrome and neither seems to do much better than the other. I know cycles are hard to come by around here, so I'd be happy to take a look if someone could point me in the right general direction.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/246, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszNtbQGQZ8stlqs4FGCdWrJjQcfDQks5twvEFgaJpZM4T4VP8 .

kaipak commented 6 years ago

That would be awesome @mrocklin! Thanks much.

mrocklin commented 6 years ago

If you get a chance I recommend trying to walk through this documentation: http://dask.pydata.org/en/latest/setup/kubernetes-helm.html

And then maybe we can chat afterwards, regardless of how it goes

On Wed, May 9, 2018 at 10:09 AM, Kai Pak notifications@github.com wrote:

That would be awesome @mrocklin https://github.com/mrocklin! Thanks much.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/246#issuecomment-387751307, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHHPJ7z7M9aFjeYhikhxbX1ERRrQks5twviHgaJpZM4T4VP8 .

rabernat commented 6 years ago

It would be great if we could get the pangeo-specific cloud setup guide in shape(#230).

mrocklin commented 6 years ago

To be clear, I'm just suggesting setting up the Dask helm chart, not any sort of JupyterHub thing. This should be quite a bit simpler, but only appropriate for a single user.

On Wed, May 9, 2018 at 10:29 AM, Ryan Abernathey notifications@github.com wrote:

It would be great if we could get the pangeo-specific cloud setup guide in shape(#230 https://github.com/pangeo-data/pangeo/issues/230).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/246#issuecomment-387758170, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszEiqPdhIkdxntk_WCI8d7G4d4Cioks5twv1CgaJpZM4T4VP8 .

jgerardsimcock commented 6 years ago

@mrocklin in terms of stability in the user experience is the single-user dask deployment more stable than the jupyterhub multi-user deployment?

mrocklin commented 6 years ago

There are fewer moving pieces, so yes, probably.

On Wed, May 9, 2018 at 11:35 PM, J Gerard notifications@github.com wrote:

@mrocklin https://github.com/mrocklin in terms of stability in the user experience is the single-user dask deployment more stable than the jupyterhub multi-user deployment?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/246#issuecomment-387942052, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszL-iC0aNQhLvb99sJ53mpFWC_uO7ks5tw7VpgaJpZM4T4VP8 .

tjcrone commented 6 years ago

@kaipak, I have also been experiencing problems like you describe, on Azure, and especially with the terminal. I'm pretty certain that this is a jupyter/nginx thing, not a pangeo thing. A long time ago in a different context, I remember improving things by changing some of the nginx settings, which in some cases are very conservative. I have not had a chance to dig back in here, but it might be one place to start looking for potential solutions.

mrocklin commented 6 years ago

There is a configuration setting in JupyterHub to control when to cull idle users jupyterhub: cull: enabled: true users: false timeout: 1200 every: 600

On Thu, May 10, 2018 at 10:00 AM, Tim Crone notifications@github.com wrote:

@kaipak https://github.com/kaipak, I have also been experiencing problems like you describe, on Azure, and especially with the terminal. I'm pretty certain that this is a jupyter/nginx thing, not a pangeo thing. A long time ago in a different context, I remember improving things by changing some of the nginx settings, which in some cases are very conservative. I have not had a chance to dig back in here, but it might be one place to start looking for potential solutions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/246#issuecomment-388061938, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBB2i-zOVAagAy6wz2kyjEIILEsuks5txEftgaJpZM4T4VP8 .

kaipak commented 6 years ago

If you get a chance I recommend trying to walk through this documentation: http://dask.pydata.org/en/latest/setup/kubernetes-helm.html And then maybe we can chat afterwards, regardless of how it goes

@mrocklin, really appreciate the help. Will look this over this week and circle back with you.

mrocklin commented 6 years ago

Sounds good. Please don't hesitate to reach out if you get stuck. I'm happy to walk you through things to accelerate progress after you've had a chance to look things over.

On Thu, May 10, 2018 at 10:20 AM, Kai Pak notifications@github.com wrote:

If you get a chance I recommend trying to walk through this documentation: http://dask.pydata.org/en/latest/setup/kubernetes-helm.html And then maybe we can chat afterwards, regardless of how it goes

@mrocklin https://github.com/mrocklin, really appreciate the help. Will look this over this week and circle back with you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/246#issuecomment-388068061, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszEiK72u8DL3mVbHCi88gbGWRvHIbks5txEzEgaJpZM4T4VP8 .

jgerardsimcock commented 6 years ago

@kaipak are you still experiencing stability issues? We've been getting a large number of 504 nginx errods during heavy IO operations. @tjcrone Can you point to where you made changes in nginx settings?

These look like they may be relevant issues: https://github.com/jupyterhub/jupyterhub/issues/1785 https://github.com/jupyter/notebook/issues/3537

kaipak commented 6 years ago

@kaipak are you still experiencing stability issues? We've been getting a large number of 504 nginx errods during heavy IO operations. @tjcrone Can you point to where you made changes in nginx settings?

These look like they may be relevant issues: jupyterhub/jupyterhub#1785 jupyter/notebook#3537

Nice find!

Yes, experiencing similar issues. Most of the operations I've been doing lately are heavy IO as I've been working primarily on storage benchmarks and doing read/writes on 600+ GB datasets.

jgerardsimcock commented 6 years ago

Hello @kaipak We've been getting so many 504 errors recently. Are you currently still experiencing these issues?

kaipak commented 6 years ago

Hello @kaipak We've been getting so many 504 errors recently. Are you currently still experiencing these issues?

Yesterday was pretty bad, but that was related to an event where something like 70 simulataneous users spun up instances and several quotas being exceeded. I'm wondering if the past problems we've experienced is related to us hitting these quotas on GCP.

jgerardsimcock commented 6 years ago

This is a common source of instability. We had to extend all the quotas on our deployment to avoid this. Also, we've found that if your user space is approaching its storage limit, users will experience inability to login or non-responsiveness.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.