pangeo-data / jupyter-earth

Jupyter meets the Earth: combining research use cases in geosciences with technical developments within the Jupyter and Pangeo ecosystems.
https://jupytearth.org
Creative Commons Zero v1.0 Universal
28 stars 6 forks source link

Unstable servers #111

Open JordiBolibar opened 2 years ago

JordiBolibar commented 2 years ago

These last days I've been having issues with my connection to JupyterHub servers. Today I've had my server stopped twice while running simulations. All of the sudden my connection dropped and I had to start again a server. This is particularly annoying when working with Julia, since one needs to download and re-compile all the packages.

Is there a reason behind this unstable behaviour? Thanks again for your help!

consideRatio commented 2 years ago

Today I've had my server stopped twice while running simulations. All of the sudden my connection dropped and I had to start again a server.

While you were actively using your server via your browser etc, not just running a script in a terminal over the night etc, your server is shut down and soon you see that jupyterhub gives you a "start server" button / choice on what server type to start?

What kind of server have you chosen? Was it a shared 1/16th, shared 1/4th, or dedicated server of some size?

JordiBolibar commented 2 years ago

Yes, while running simulations directly from VSCode (i.e. active development and launching simulations). First I see that the VSCode connection has dropped, then I go to the launch pad or to a terminal and it asks to restart the server.

I'm using a dedicated "massive" machine.

consideRatio commented 2 years ago

@JordiBolibar are you using SSH from your local computer or VSCode from your local computer, or are you doing everything from browsing hub.jupytearth.org (where you can have a terminal and use vscode).

JordiBolibar commented 2 years ago

I'm running everything directly on the browser (including VSCode). So no SSH.

consideRatio commented 2 years ago

@JordiBolibar thanks, investigating further - about when was the last server shutdown event? It can help me find relevant log entries in a sea of logs?

JordiBolibar commented 2 years ago

This morning, between 9 and 11 am CET, I had 2 shutdowns.

consideRatio commented 2 years ago

@JordiBolibar is this the named server called ODINN, or something else?

JordiBolibar commented 2 years ago

Yes, I'm using ODINN and ODINN-2. I think this morning I only used ODINN.

consideRatio commented 2 years ago

I observe this in the logs of JupyterHub, so it seems that your server is getting culled by considered to be inactive. All times below are in UTC time, so one hour behind CET.

# march 1st, 13:11 CET time
[I 220301 12:11:26 __init__:190] Culling server jordibolibar/ODINN (inactive for 01:08:54)
# march 1st, 23:21 CET time
[I 220301 22:21:26 __init__:190] Culling server jordibolibar (inactive for 01:03:12)
# march 2nd, 21:01 CET time
[I 220302 20:01:26 __init__:190] Culling server jordibolibar (inactive for 01:07:32)
# march 3rd, 17:11 CET time
[I 220303 16:11:26 __init__:190] Culling server jordibolibar (inactive for 01:00:35)

In #105 we discussed a workaround, are these shutdowns following having applied such workaround or not?

JordiBolibar commented 2 years ago

Well, that was to leave simulations running overnight. Here I am actively using the session. The thing is that I've had zero issues with the same working routine until some days ago.

And BTW, the Jupyter notebook with the sleep command only did the trick sometimes. In the last month it stopped working and my server was getting stopped anyway.

JordiBolibar commented 2 years ago

Hi @consideRatio. Friday everything worked super smoothly, but today things are extremely unstable again. I've had 4 shutdowns already, even when actively coding and launching simulations. The problems also included issues loading VSCode from codeserver.

consideRatio commented 2 years ago

Hmmm okay so it seems you have surpassed the memory in a way that made you get kicked out by k8s, which is a quite harsh stop compared to for example having a process shut down.

Events:
  Type     Reason         Age    From                 Message
  ----     ------         ----   ----                 -------
  Warning  Evicted        6m14s  kubelet              The node was low on resource: memory.
  Normal   Killing        6m14s  kubelet              Stopping container notebook
  Warning  FailedKillPod  6m11s  kubelet              error killing pod: failed to "KillPodSandbox" for "ab66948f-92aa-418d-b4d8-00c017b5fd90" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"jupyter-jordibolibar---4f-44-49-4e-4e-2d2_prod\" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""

The following is also relevant. I can see one culling event due to "inactivity" according to JupyterHub since 2022-03-08 20:58:05.022 - but I don't have access to logs for longer back in time.

kubectl logs deploy/hub -c hub | grep Culling

[I 220309 14:31:26 __init__:190] Culling server jordibolibar (inactive for 01:08:46)

Whenever you are culled by JupyterHub, you are so because you are considered inactive by JupyterHub. By using code-server, you may bypass some mechanism that makes JupyterHub think you are active. I want to try follow that up, but there isn't a quick fix for that besides the workaround ideas suggested in #105.

Besides being culled, you can get terminated for using too much memory. That can happen both by Kubernetes which is a higher level kind of control, and by linux itself inside the docker container you control, which is a lower level kind of termination.

To work effectively without getting blocked by these issues I suggest:

  1. Try monitor your memory usage so you can realize if you are about to hit a limit
  2. Use workarounds when not using the JupyterLab interface to work with your server or running Jupyter notebooks.

Thank you for reporting these experiences, it helps drive development towards resolving these kinds of issues long term even though it is hard to fix them quickly without upstream changes.

consideRatio commented 2 years ago

On the 64 CPU node you are running now, you should have at least 224 GB of memory. I entered the container and observed with top that you were right now for example using... ~10 GB it seemed, which should be absolutely fine.

 6558 jovyan    20   0 4755164   1.0g 224596 R 100.3   0.4   1:24.18 julia                                                                                                                                                                                                    
 6560 jovyan    20   0 4771004   1.0g 225488 R 100.3   0.4   1:23.37 julia                                                                                                                                                                                                    
 6563 jovyan    20   0 4782064   1.0g 224796 R 100.3   0.4   1:27.98 julia                                                                                                                                                                                                    
 6565 jovyan    20   0 4842488   1.0g 224684 R 100.3   0.4   1:26.68 julia                                                                                                                                                                                                    
 1344 jovyan    20   0 5497596   1.1g 228344 R 100.0   0.5   2:04.66 julia                                                                                                                                                                                                    
 6557 jovyan    20   0 4838400   1.0g 225828 R 100.0   0.4   1:23.53 julia                                                                                                                                                                                                    
 6559 jovyan    20   0 4771592   1.0g 224236 R 100.0   0.4   1:24.42 julia                                                                                                                                                                                                    
 6561 jovyan    20   0 4838436   1.0g 225548 R 100.0   0.4   1:26.19 julia                                                                                                                                                                                                    
 6562 jovyan    20   0 4905796   1.0g 225748 R 100.0   0.4   1:23.80 julia                                                                                                                                                                                                    
 6564 jovyan    20   0 4842768   1.0g 225536 R 100.0   0.4   1:24.76 julia
JordiBolibar commented 2 years ago

Yes, I've been trying to understand what happened yesterday. I think I was running a very memory expensive task on several nodes, so I basically blew the memory of the server. I controlled memory usage and I'm using the notebook sleep trick to keep the kernel alive, and so far it is working better. I still get the server shut down sometimes, despite these measures, but it is way less frequent.