iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
290 stars 27 forks source link

readme intro suggestions #558

Open dberenbaum opened 2 years ago

dberenbaum commented 2 years ago

TPI is a Terraform plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.

  • Lower cost with spot recovery: transparent data checkpoint/restore & auto-respawning of low-cost spot/preemptible instances
  • No cloud vendor lock-in: switch between clouds with just one line thanks to unified abstraction
  • No waste: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use
  • Developer-first experience: one-command data sync & code execution with no external server, making the cloud feel like a laptop

Sorry for not doing this earlier, but hopefully it's still valuable to discuss.

I think the first sentence should focus on user workflow and not tools (Terraform). What about something like "Run your ML training in the cloud... without needing to be a cloud expert"?

The individual bullets feel more like they belong under "Why TPI?" They don't explain what TPI does or when to use it as much as they describe its particular benefits over other solutions. I wouldn't understand enough to care until I knew more about what it does. Some ideas for bullets here (the ones in brackets are probably less essential to the basic workflow even though they provide major benefits):

Also consider embedding https://www.youtube.com/watch?v=2fEgO8SazSE. I think this gives a great succinct description of how to use TPI to scale up your ML training.