Investigate Ceph for use in Nebari

Adam-D-Lewis commented 3 months ago

Context

Rook Ceph might help with Conda store worker scaling issues caused by slow NFS as mentioned in https://github.com/nebari-dev/nebari/issues/2505. It also might help with making the user home directory more performant and give us an option of replacing the minio pods with ceph's object storage as well.

I've successfully deployed the rook operator helm chart on an AKS cluster with Nebari and created a shared file system between 2 pods on different nodes. The steps were roughly as follows:

Install Rook helm chart on cluster
create a CephCluster object
create a CephFileSystem and CephFilesystemSubVolumeGroup (in same namespace as your cluster I think)
create a StorageClass referencing the CephFileSystem
create a PVC which references the StorageClass (any namespace)
Create a pod/deployment/etc. that mounts the pvc

We'll need at least 3 nodes running for ceph to take advantage of high availability and resiliency features of Ceph. We should be able to make them smaller than the current general node. We may want to create a separate node group for just the ceph storage with 3 small-ish nodes. I think min node size for Ceph would be 2vcpu, 4Gi RAM per node.

The storage class names we'll use to back Ceph storage will likely be different on each cloud provider. This should also be tested on the local deployment.

Eventually, it'd be nice to set up auto expansion of storage as space fills up. It seems to be possible on some cloud providers at least according to these docs.

Adam-D-Lewis commented 3 months ago

A few issues encountered:

Ceph let's you make a PVC larger than you have storage backing it which is a little confusing if you aren't aware of that.
I've had a hard time deleteing CephFilesystem objects. They tend to linger.
- https://github.com/rook/rook/issues/6002
- https://rook.io/docs/rook/latest-release/Getting-Started/ceph-teardown/#cleaning-up-a-cluster

Adam-D-Lewis commented 3 months ago

I tested the PR linked to this issue and ran the following and it took 10-17 seconds on that Nebari deployment while it took ~60s on a standard GCP deployment (no Ceph). IT took 0.8 s on /tmp on each deployment as well. So it seems like Ceph is making writing small files 3-6x faster!

import os

# Define the directory where the files will be created
dir_path = "/home/ad/fio-test"
# dir_path = "/tmp/fio-test"

# Create the directory if it doesn't exist
os.makedirs(dir_path, exist_ok=True)

%%time
# Define the size of each file in bytes (1 KiB = 1024 bytes)
file_size = 1024

# Create 10,000 files
for i in range(10000):
    # Define the file path
    file_path = os.path.join(dir_path, f"file{i:04d}.txt")

    # Create the file and write random data to it
    with open(file_path, "wb") as f:
        f.write(os.urandom(file_size))

Adam-D-Lewis commented 3 months ago

I tested a single node ceph setup and the script above took 5-10s. I assume the speedup is due to not having replication on a single node setup.

nebari-dev / nebari

Investigate Ceph for use in Nebari #2534

Context