Adam-D-Lewis commented 3 weeks ago

Context

We'd like to scale conda store workers up to allow many solves at the same time. See here for more info. However, the conda store worker pod also acts as a NFS server. I believe this is b/c writing the environment files after the solve would be slow over the network (to an NFS drive) so we just co-locate the NFS server with the conda-store worker so files aren't going over the network when saving. However, this limits the conda store worker scalability beyond a single node (the node the NFS server is on).

Options:

We could separate the conda store workers from NFS server, but the conda store workers become NFS clients and then have to write many small files (~40k to ~150k files for each env) over the network which is slow and likely adds 10 minutes (I want to benchmark this, but haven't yet) to the conda store env creation process which is not ideal and also likely won't scale with many concurrent conda solves anyway. With this option, we'd likely want to also try to improve NFS performance. This guide or this guide may have some useful tips.
We could try a distributed file system like CephFS, but it may require a learning curve and dealing with that additional complexity.
We could tar the environments and write to object storage and have some logic to pull them from object storage when a user boots up a pod, but this would require significant change to our current process, and so would be a lot of work, and may have additional issues we haven't considered yet (making sure users get new envs without restarting their pod, etc).
We could try EFS, Google File Store, and Azure Files, (not sure if DO has an NFS equivalent), but requires more maintenance since it's a separate solution for each cloud provider. Those are also NFS or SMB drives anyway, so unless the cloud providers have optimized them somehow (very possible), they may not be any better than what we currently have.
I'm really not sure if this is a good idea, but I'll throw it out there anyway. We could mount object storage as a filesystem. Store the conda envs in object storage (not zipped or tarred) and use object storage like a local file storage. Again, I'm not sure what drawbacks this might have. Latency is high on object storage, but maybe it's okay if it's just loading the python binary and intalled libraries from there. Not sure. Because latency is high, we may have the same problems as the NFS drive though (slow writing of environment files after solve). FYI, I see that the AWS s3 latency is ~100-200 milliseconds if we want to compare with our NFS performance later.

Anything else?

I looked at how large envs were and how many files they consisted of on a Nebari delpoyment, and found the following. You can see that many envs have ~10k files, but it's not uncommon to have ~100k and some even have ~140k files.

Many envs are <1GiB in total size, most envs are <10GiB, but some are >20GiB in total size.

The 10, 20, and 30 on the x axis represent 1KiB, 1MiB, and 1GiB. This graph is individual number of files vs file sizes. The distribution is roughly lognormal with the max at 1KiB and few files > 1 MiB in size.

dharhas commented 3 weeks ago

We used to use EFS including the high performance EFS. This still had the same r/w slowdown issues.

CephFS might be worth looking into. We have had clients that have used it before and it is a mature solution.

Adam-D-Lewis commented 3 weeks ago

Older discussion on the topic: https://github.com/orgs/nebari-dev/discussions/1150 Chris recommends using docker images or tarballs of the environments.

Docker Images

I'm not sure how docker images would work. Conda store envs would need to include jupyterlab in every environment and any desired jupyterlab extensions, I guess? It'd also need to be compatible with whatever version of jupyterhub we have running. Then you'd only have 1 environment available to you in your user pod? You'd have to select the conda env you want to boot up along with the instance size?

Tarballs

I think tarballs would be similar to option 3 in my original description. I'll throw out some possible implementation details here. You put the conda store workers as side car containers to the jupyterlab user server and is notified anytime an environment pertaining to that user (user or user groups) is created/modified/changed which env is active as already occurs with conda-store-workers. The sidecar container would tar and upload the saved environment to conda store and update the symlinks similar to what the conda-store worker already does. I think the only change would be that on startup, the conda-store worker would need to download all the tarballs from conda store and unpack them in the appropriate locations. There is the potential for temporary weird file system bugs (would be fixed when you boot up a new jupyterlab user pod) since there isn't a shared file system anymore. I wonder if we could display kernels and only download and unpack them when the user goes to use them (e.g. open a notebook, run conda activate, etc.)

This conda-store worker sidecar introduces other issues. The user pods would need to make sure they are big enough to build that particular conda env which would be a hassle to spin up a larger server just to do a conda build then spin up your smaller server again. Maybe this memory issue is solved if every user pod has access to some high speed disk that can be used as swap as suggested here, but we need to test how much RAM conda solves use and how long it takes to solve the env using and not using swap to get a better idea of how big of an issue this is and if the swap will solve it satisfactorily.

It might also introduce some security concerns since the conda store worker pod could modify the tarball prior to upload and affect other users (in the case of a group environment), but I'm not sure this is any worse than what we currently have. Users still won't have write access to that directory so they can't modify it prior to upload and we always require trust from the admin who chooses which docker image to use.

dcmcand commented 3 weeks ago

I have run a ceph cluster in the past, and it is not trivial. However, as @dharhas says, it is a very mature solution. It looks like https://rook.io/ provides an easy onramp for ceph on k8s. Looking at https://www.digitalocean.com/community/tutorials/how-to-set-up-a-ceph-cluster-within-kubernetes-using-rook#step-3-adding-block-storage it seems like that could be a good way to go to improve our storage within nebari.

dcmcand commented 3 weeks ago

It also allows object store mounting with an s3 compatible api, so we could replace minio here

Adam-D-Lewis commented 2 weeks ago

I'm looking into Ceph a bit in this issue

nebari-dev / nebari

Conda store worker scaling issues #2505

Context

Anything else?

Docker Images

Tarballs