TPI is a Terraform plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.
Lower cost with spot recovery: transparent data checkpoint/restore & auto-respawning of low-cost spot/preemptible instances
No cloud vendor lock-in: switch between clouds with just one line thanks to unified abstraction
No waste: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use
Developer-first experience: one-command data sync & code execution with no external server, making the cloud feel like a laptop
Sorry for not doing this earlier, but hopefully it's still valuable to discuss.
I think the first sentence should focus on user workflow and not tools (Terraform). What about something like "Run your ML training in the cloud... without needing to be a cloud expert"?
The individual bullets feel more like they belong under "Why TPI?" They don't explain what TPI does or when to use it as much as they describe its particular benefits over other solutions. I wouldn't understand enough to care until I knew more about what it does. Some ideas for bullets here (the ones in brackets are probably less essential to the basic workflow even though they provide major benefits):
Configure everything (commands, data to sync, cloud resource requirements) in a single file.
Upload the data and run the job in the cloud with a single command.
[Get live logs and outputs from your local machine.]
[Keep the job running even if it's interrupted or your local machine shuts down.]
Automatically download the results and tear down the cloud resources when complete.
Sorry for not doing this earlier, but hopefully it's still valuable to discuss.
I think the first sentence should focus on user workflow and not tools (Terraform). What about something like "Run your ML training in the cloud... without needing to be a cloud expert"?
The individual bullets feel more like they belong under "Why TPI?" They don't explain what TPI does or when to use it as much as they describe its particular benefits over other solutions. I wouldn't understand enough to care until I knew more about what it does. Some ideas for bullets here (the ones in brackets are probably less essential to the basic workflow even though they provide major benefits):
Also consider embedding https://www.youtube.com/watch?v=2fEgO8SazSE. I think this gives a great succinct description of how to use TPI to scale up your ML training.