raphaeldussin / example.pangeo.io-deploy

Deployment automation for example.pangeo.io
0 stars 1 forks source link

500 server error #4

Closed rabernat closed 5 years ago

rabernat commented 5 years ago

I am now getting 500 server errors when trying to start a notebook. Here's how I would debug.

# connect to cluster
$ gcloud container clusters get-credentials ocean-pangeo-io-cluster --zone us-central1-b --project pangeo-181919

# list pods
$ kubectl -n staging get pods
NAME                         READY     STATUS    RESTARTS   AGE
autohttps-6ffdcc94ff-pz77h   2/2       Running   0          5d
hub-7d9c6c8465-7lh8d         1/1       Running   0          14h
proxy-7b4fb46447-rgjms       1/1       Running   0          5d

# get logs from hub
$ kubectl -n staging logs hub-7d9c6c8465-7lh8d 
# skip to relevant part
[I 2018-11-27 13:17:49.311 JupyterHub log:158] 302 GET / -> /hub (@10.128.0.26) 0.78ms
[I 2018-11-27 13:17:49.352 JupyterHub log:158] 302 GET /hub -> /hub/ (@10.128.0.26) 0.69ms
[I 2018-11-27 13:17:49.434 JupyterHub log:158] 302 GET /hub/ -> /user/rabernat/ (rabernat@10.128.0.26) 40.93ms
[I 2018-11-27 13:17:49.488 JupyterHub log:158] 302 GET /user/rabernat/ -> /hub/user/rabernat/ (@10.128.0.26) 0.84ms
[W 2018-11-27 13:17:49.580 JupyterHub base:714] User rabernat is slow to start (timeout=0)
[I 2018-11-27 13:17:49.582 JupyterHub base:1056] rabernat is pending spawn
[I 2018-11-27 13:17:49.584 JupyterHub log:158] 200 GET /hub/user/rabernat/ (rabernat@10.128.0.26) 53.93ms
[I 2018-11-27 13:17:49.590 JupyterHub spawner:1671] PVC claim-rabernat already exists, so did not create new pvc.
[I 2018-11-27 13:18:13.545 JupyterHub proxy:301] Checking routes
[W 181127 13:18:13 cull_idle_servers:128] Not culling server rabernat with pending spawn
[I 2018-11-27 13:18:13.731 JupyterHub log:158] 200 GET /hub/api/users (cull-idle@127.0.0.1) 13.38ms
[I 2018-11-27 13:18:39.255 JupyterHub spawner:1770] Deleting pod jupyter-rabernat
[W 2018-11-27 13:18:49.773 JupyterHub user:504] rabernat's server never showed up at http://10.40.47.10:8888/user/rabernat/ after 30 seconds. Giving up
[W 2018-11-27 13:18:49.827 JupyterHub base:689] 4 consecutive spawns failed.  Hub will exit if failure count reaches 5 before succeeding
[E 2018-11-27 13:18:49.828 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py:619> exception=TimeoutError("Server at http://10.40.47.10:8888/user/rabernat/ didn't respond in 30 seconds",)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 522, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 496, in spawn
        resp = await server.wait_up(http=True, timeout=spawner.http_timeout)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/utils.py", line 197, in wait_for_http_server
        timeout=timeout
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/utils.py", line 155, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Server at http://10.40.47.10:8888/user/rabernat/ didn't respond in 30 seconds
rabernat commented 5 years ago

Then I monitored the notebook pod startup process and saw this

$ kubectl -n staging logs jupyter-rabernat
cp: preserving times for '/home/jovyan': Operation not permitted
Traceback (most recent call last):
  File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 5, in <module>
    from jupyterhub.singleuser import SingleUserNotebookApp
ModuleNotFoundError: No module named 'jupyterhub'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/bin/jupyter-labhub", line 7, in <module>
    from jupyterlab.labhubapp import main
  File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 8, in <module>
    raise ImportError('You must have jupyterhub installed for this to work.')
ImportError: You must have jupyterhub installed for this to work.
Ryans-MacBook-Pro:xarray rpa$ kubectl -n staging logs jupyter-rabernat
cp: preserving times for '/home/jovyan': Operation not permitted
Traceback (most recent call last):
  File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 5, in <module>
    from jupyterhub.singleuser import SingleUserNotebookApp
ModuleNotFoundError: No module named 'jupyterhub'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/bin/jupyter-labhub", line 7, in <module>
    from jupyterlab.labhubapp import main
  File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 8, in <module>
    raise ImportError('You must have jupyterhub installed for this to work.')
ImportError: You must have jupyterhub installed for this to work.
Ryans-MacBook-Pro:xarray rpa$ kubectl -n staging logs jupyter-rabernat
cp: preserving times for '/home/jovyan': Operation not permitted
Traceback (most recent call last):
  File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 5, in <module>
    from jupyterhub.singleuser import SingleUserNotebookApp
ModuleNotFoundError: No module named 'jupyterhub'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/bin/jupyter-labhub", line 7, in <module>
    from jupyterlab.labhubapp import main
  File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 8, in <module>
    raise ImportError('You must have jupyterhub installed for this to work.')
ImportError: You must have jupyterhub installed for this to work.
rabernat commented 5 years ago

So it looks like my suggestion to "just use my binder image" was not 100% a good idea. Certain important packages are missing. I recommend a side-by-side comparison of our environment with https://github.com/pangeo-data/dev.pangeo.io-deploy/tree/staging/deployments/dev.pangeo.io/image/binder

That should resolve things.

raphaeldussin commented 5 years ago

I think your image is fine. I tried to kill my pod and I broke it. Gonna redo the cluster from scratch.

rabernat commented 5 years ago

I think your image is fine

I disagree. My environment.yml lacks jupyterhub, which is necessary for this to work. That explains the error I got in my pod.

Redoing the cluster from scratch seems like overkill!

raphaeldussin commented 5 years ago

cluster is rebuild, it was quick. I tried first with your image but it fails at the creation of the cluster, not on the 500 error anymore. I am reverting to the previous environment, tests it works then move to the dev.pangeo environment.

raphaeldussin commented 5 years ago

the dev environment does not work for me. It fixes the #2 issue but the compute cells don't execute properly and just stall. I am gonna revert to the original environment and build a new cluster for testing.