nebius / soperator

Run Slurm in Kubernetes
Apache License 2.0
103 stars 9 forks source link

Performance of Jailed space #111

Open thien-lm opened 1 month ago

thien-lm commented 1 month ago

In theory, seems that the jailed space will have poor performance. Did anyone face that issue when the number of workers in Slurm cluster inrease ?

rdjjke commented 1 month ago

Hello thien-lm,

Thank you for the question!

The impact on performance largely depends on the storage solution you're using.

In general, shared storage tends to have higher latencies for I/O operations compared to non-shared options, though it can offer much higher overall throughput.

We tested three shared storage solutions in practice:

  1. NFS share
  2. Nebius shared filesystem
  3. GlusterFS

Here’s a breakdown based on two common usage scenarios:

  1. Single-threaded access to many small files (< several MiB) with frequent metadata operations Examples: Installing apt packages, running executables, cloning small git repositories, editing files, etc. While there’s no severe lag with any of these solutions, these operations tend to be slower than on non-shared storage.
    • NFS share: Provides the lowest latency; in small clusters (up to ~20 nodes), the experience is almost identical to local disk usage. However, latency increases linearly with larger clusters.
    • Nebius shared filesystem: Latency is noticeably higher, especially when installing large packages or cloning sizable repositories, but it remains usable. Importantly, performance stays consistent regardless of cluster size. Basic tasks like editing files or executing Slurm commands are handled smoothly.
    • GlusterFS: Similar to the Nebius shared filesystem, though it generally offers slightly lower latency and faster metadata operations.
  2. Multi-threaded access to large files (at least several MiB in size) Examples: Downloading and reading datasets, checkpointing processes, and loading from checkpoints.
    • NFS share: Performance is poor when handling concurrent reads/writes on the same files from multiple processes or hosts. Clusters with over 50-100 nodes may struggle to maintain stability.
    • Nebius shared filesystem: Excels in handling parallel access, providing strong throughput (up to ~24Gb/s per host and up to 800Gb/s overall).
    • GlusterFS: Similar to Nebius shared filesystem, though with somewhat lower throughput (sorry, I don't know exact numbers).

Since Soperator is primarily designed for ML model training, distributed filesystems like GlusterFS or the Nebius shared filesystem are well-suited for the most demanding tasks such as checkpointing and dataset loading. These operations benefit the most from high throughput, while tasks like installing software are typically less frequent and it's not a big deal if they take 2-3 times longer. If you use PyTorch, you can also set higher num_workers and prefetch_factor for its DataLoader to make it work really fast.

Additionally, Soperator allows for flexible storage customization. For example, the "Jail" storage can be an NFS share, while "Jail submounts" can be backed by distributed filesystems. These submounts can leverage any storage type supported by your Kubernetes cluster (e.g., ephemeral or persistent, local or shared, disk-based or in-memory, S3 or OCI) to meet specific use cases.

Some files and directories are non-shared by default: all virtual filesystems including /tmp, /var/run, some low-level GPU libraries, and some config files (though these ones are identical).

Some links: