Open thien-lm opened 1 month ago
Hello thien-lm,
Thank you for the question!
The impact on performance largely depends on the storage solution you're using.
In general, shared storage tends to have higher latencies for I/O operations compared to non-shared options, though it can offer much higher overall throughput.
We tested three shared storage solutions in practice:
Here’s a breakdown based on two common usage scenarios:
Since Soperator is primarily designed for ML model training, distributed filesystems like GlusterFS or the Nebius shared filesystem are well-suited for the most demanding tasks such as checkpointing and dataset loading. These operations benefit the most from high throughput, while tasks like installing software are typically less frequent and it's not a big deal if they take 2-3 times longer. If you use PyTorch, you can also set higher num_workers
and prefetch_factor
for its DataLoader
to make it work really fast.
Additionally, Soperator allows for flexible storage customization. For example, the "Jail" storage can be an NFS share, while "Jail submounts" can be backed by distributed filesystems. These submounts can leverage any storage type supported by your Kubernetes cluster (e.g., ephemeral or persistent, local or shared, disk-based or in-memory, S3 or OCI) to meet specific use cases.
Some files and directories are non-shared by default: all virtual filesystems including /tmp
, /var/run
, some low-level GPU libraries, and some config files (though these ones are identical).
Some links:
import pytorch
performance on NFS vs. distributed filesystems as the cluster size scales: https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy?open=false#§drivers-user-experience-and-software
In theory, seems that the jailed space will have poor performance. Did anyone face that issue when the number of workers in Slurm cluster inrease ?