neurohackademy / nh2020-jupyterhub

hub.neurohackademy.org: Deployment config, docker image, documentation.
17 stars 27 forks source link

storage: cache NFS storage with rsync #60

Closed consideRatio closed 4 years ago

consideRatio commented 4 years ago

I want to ensure that the data we provide in ~/data is able to handle the load of having hundreds of simultaneous users accessing it. This could be an issue, but it may not be an issue. I don't know that, but I want to ensure it doesn't become an issue.

Due to this, I plan to provide a cached version of the NFS servers to the participants under ~/data. Then, they will read something from a node-local disk instead than from the NFS server, which will help us avoid the risk of overloading the NFS server if for example we would need to read too much data at the same time.

  1. Create a k8s DaemonSet (DS) that automatically creates a pod on each node that has users.
  2. Let this DS pod mount the NFS servers /data in read only mode, and rsync copy all content to a hostPath volume every minute.
  3. Let this hostPath mount to /nh/data if the user is a participant in a read only mode, and mount as NFS if its an instructor.

Motivation

The NFS server we use is a Google Cloud managed NFS server service called Filestore. We have a BASIC_HDD Filestore instance with a size of 1TB. According to this documentation we can expect a throughput of 100MB/sec. That means that if we have 100 participants that wants to read 1GB, it will take 10 second per participant, which results in 1000 seconds of wait which is too much.

Also, according to the documentation about throughput of persistent disks, its throughput scales with size and is about ~4 times higher with SSD per size. These are the numbers.

yuvipanda commented 4 years ago

I think this might not be a problem, and if it is, switching to SSD Filestore might be an easier solution. If you mount NFS once per node, I think you can then think of 100MB/s as per-node, rather than per-user, since it's gonna get cached locally.