pangeo-data / pangeo-cloud-federation

Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
https://pangeo.io/cloud.html
58 stars 32 forks source link

Google Cloud Filestore out of space #601

Open rabernat opened 4 years ago

rabernat commented 4 years ago

We are out of space again on our shared NFS filestore. Jupyter pods can't start

[I 2020-05-09 01:51:00.440 SingleUserNotebookApp notebookapp:1924] http://jupyter-0000-2d0001-2d5999-2d4917:8888/user/0000-0001-5999-4917/
[I 2020-05-09 01:51:00.440 SingleUserNotebookApp notebookapp:1925] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[E 2020-05-09 01:51:00.446 SingleUserNotebookApp notebookapp:1821] Failed to write server-info to /home/jovyan/.local/share/jupyter/runtime/nbserver-1.json: [Errno 28] No space left on device
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/bin/jupyterhub-singleuser", line 12, in <module>
    sys.exit(main())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 660, in main
    return SingleUserNotebookApp.launch_instance(argv)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_core/application.py", line 270, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/traitlets/config/application.py", line 664, in launch_instance
    app.start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 565, in start
    super(SingleUserNotebookApp, self).start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/notebookapp.py", line 1933, in start
    self.write_browser_open_file()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/notebookapp.py", line 1843, in write_browser_open_file
    self._write_browser_open_file(open_url, f)
OSError: [Errno 28] No space left on device

@jhamman - can you remind us how you diagnosed the disk usage by user?

Happy Friday night everyone! 🙃

rabernat commented 4 years ago

I tried examining the filesystem myself following these instructions to mount a filestore. But I got stuck

sudo mkdir -p /mnt/pangeo-filestore
sudo mount 171.161.186:/test /mnt/pangeo-filestore

The mount command eventually timed out with

mount.nfs: Network is unreachable

I must be doing something wrong, but can't figure out what.

rabernat commented 4 years ago

I fixed this very temporarily by increasing to 2.2 TB. But we really need to sort this out and figure out a better solution for home directories.

rabernat commented 4 years ago

I must be doing something wrong, but can't figure out what.

Note to self: the compute instance must be in the same region as the filestore. The firs time I tried it, my compute instance was in us-central-1a but the filestore is in us-central-1b. In the same region, things work.

yuvipanda commented 4 years ago

I wrote a storage retention policy for uc berkeley hubs - https://docs.datahub.berkeley.edu/en/latest/topic/storage-retention.html. Going to implement some code soon. Maybe adopt a similar policy?