substratusai / runbooks

Finetune LLMs on K8s by using Runbooks
https://www.substratus.ai
Other
168 stars 14 forks source link

Support for AWS #12

Open nstogner opened 1 year ago

nstogner commented 1 year ago

To support AWS we would need to:

brandonjbjelland commented 1 year ago

I wanted to get a start on understanding GCSFuse equivalents. It turns out AWS makes it difficult or charges for the privilege of using S3 as a file system! S3 notably doesn't make an appearance on the k8s-csi supported drivers list. Here's what I came away with:

  1. s3fs-fuse seems to have a good amount of mindshare but doesn't support IRSA. This linked project scratches at how one might use it in EKS but the production-readiness is def not there.
  2. yandex-cloud produces a geesefs-based s3 filesystem with a csi driver having close to POSIX compliance. Some support here on SO. The geesefs readme (if it's to be believed) gives a nice perf and POSIX compatibility table comparing to other projects here. Of those, I don't actually think rclone fits the bill but potentially.
  3. goofys looks compelling from a performance and cross-platform standpoint (a HN post also has a fan) but looks to have shortcomings compared to the two above. I wonder if the first non-POSIX behavior outlined is a dealbreaker or not:

Checkpointing models, logging nb outputs, saving datasets - I believe these are all constitute sequential write operations but it's hard to know for sure. This blog post outlines how to use it within an EKS context: https://dev.to/otomato_io/mount-s3-objects-to-kubernetes-pods-12f5

Users of s3fs-fuse seem to have in part migrated toward goofys due to lack of support.

On the AWS supported side, the EKS module makes adding the fsx-lustre-csi-driver easy. This seems like the closest thing to gcsfuse but closer inspection on the pricing page has me think this runs counter to our project goals:

File system storage: You pay for the average amount of storage provisioned for your file systems per month, measured in gigabyte-months "GB-months," as shown in the pricing examples.

That seems like a non-starter for substratus.


I come away thinking yandex-cloud/k8s-csi-s3 and kahing/goofys are the best contenders and warrant a closer look. Rclone has no officially supported csi driver (unofficial here). That may be sufficient to give it a test in the bake-off.

Worth mention: As a totally different option I all but discarded, the CSI driver for EBS and EFS exist but I think we should stick to blob storage.

nstogner commented 1 year ago

I wonder if the first non-POSIX behavior outlined is a dealbreaker or not

GCS Fuse is non-POSIX: https://cloud.google.com/storage/docs/gcs-fuse#expandable-1

One more (abandoned) CSI driver project for reference: https://github.com/ctrox/csi-s3/tree/master

brandonjbjelland commented 1 year ago

And yet another: https://github.com/gaul/s3proxy (I think I closed it instantly when I saw it was java)

samos123 commented 1 year ago

Something else to consider that would unblock any environment is to include a K8s native storage provider such as OpenEBS with Maya. Maybe if OpenEBS + Maya performs better it would allow us to get rid of GCSfuse as well. It seems Maya doesn't depend on iSCSI and relies on pure TCP so it would work in any environment.

Downside of such an approach is there is no way to get a signed URL or something like that. So we would have to find another way to upload tars to remote containers.

https://openebs.io/docs/introduction/usecases#building-scalable-websites-and-ml-pipelines