TPI is a Terraform plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.
Supported cloud vendors include:
There are several reasons to use TPI instead of other related solutions (custom scripts and/or cloud orchestrators):
[^scalers]: AWS Auto Scaling Groups, Azure VM Scale Sets, GCP managed instance groups, and Kubernetes Jobs.
TPI is used to power CML, bringing cloud providers to existing GitHub, GitLab & Bitbucket CI/CD workflows (repository).
brew tap hashicorp/tap && brew install hashicorp/tap/terraform
choco install terraform
conda install -c conda-forge terraform
sudo apt-get update && sudo apt-get install -y gnupg software-properties-common curl
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt-get update && sudo apt-get install terraform
In a project root directory, create a file named main.tf
with the following contents:
terraform {
required_providers { iterative = { source = "iterative/iterative" } }
}
provider "iterative" {}
resource "iterative_task" "example" {
cloud = "aws" # or any of: gcp, az, k8s
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
spot = 0 # auto-price. Default -1 to disable, or >0 for hourly USD limit
disk_size = -1 # GB. Default -1 for automatic
storage {
workdir = "." # default blank (don't upload)
output = "results" # default blank (don't download). Relative to workdir
}
script = <<-END
#!/bin/bash
# create output directory if needed
mkdir -p results
# read last result (in case of spot/preemptible instance recovery)
if test -f results/epoch.txt; then EPOCH="$(cat results/epoch.txt)"; fi
EPOCH=$${EPOCH:-1} # start from 1 if last result not found
echo "(re)starting training loop from $EPOCH up to 1337 epochs"
for epoch in $(seq $EPOCH 1337); do
sleep 1
echo "$epoch" | tee results/epoch.txt
done
END
}
See the reference for the full list of options for main.tf
-- including more information on machine
types with and without GPUs.
Run this once (in the directory containing main.tf
) to download the required_providers
:
terraform init
export TF_LOG_PROVIDER=INFO
terraform apply
This launches a machine
in the cloud
, uploads workdir
, and runs the script
. Upon completion (or error), the machine
is terminated.
With spot/preemptible instances (spot >= 0
), auto-recovery logic and persistent (disk_size
) storage will be used to relaunch interrupted tasks.
Results and logs are periodically synced to persistent cloud storage. To query this status and view logs:
terraform refresh
terraform show
terraform destroy
This terminates the machine
(if still running), downloads output
, and removes the persistent disk_size
storage.
This diagram may help to see what TPI does under-the-hood:
flowchart LR
subgraph tpi [what TPI manages]
direction LR
subgraph you [what you manage]
direction LR
A([Personal Computer])
end
B[("Cloud Storage (low cost)")]
C{{"Cloud instance scaler (zero cost)"}}
D[["Cloud (spot) Instance"]]
A ---> |2. create cloud storage| B
A --> |1. create cloud instance scaler| C
A ==> |3. upload script & workdir| B
A -.-> |"4. offline (lunch break)"| A
C -.-> |"5. (re)provision instance"| D
D ==> |7. run script| D
B <-.-> |6. persistent workdir cache| D
D ==> |8. script end,\nshutdown instance| B
D -.-> |outage| C
B ==> |9. download output| A
end
style you fill:#FFFFFF00,stroke:#13ADC7
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px
style A fill:#13ADC7,stroke:#333333,color:#000000
style B fill:#945DD5,stroke:#333333,color:#000000
style D fill:#F46737,stroke:#333333,color:#000000
style C fill:#7B61FF,stroke:#333333,color:#000000
TPI is a CLI tool bringing the power of bare-metal cloud to a bare-metal local laptop. We're working on more featureful and visual interfaces. We'd also like to have more native support for distributed (multi-instance) training, more data sync optimisations & options, and tighter ecosystem integration with tools such as DVC. Plus of course more examples for Data Scientists and Machine Learning Engineers - from Jupyter, VSCode, and Codespaces to improving the live logging/monitoring/reporting experience.
The getting started guide has some more information. In case of errors, extra debugging information is available using TF_LOG_PROVIDER=DEBUG
instead of INFO
.
Feature requests and bugs can be reported via GitHub issues, while general questions and feedback are very welcome on our active Discord server.
Instead of using the latest stable release, a local copy of the repository must be used.
git clone https://github.com/iterative/terraform-provider-iterative
cd terraform-provider-iterative
make install
source = "github.com/iterative/iterative"
in your main.tf
to use the local repository (source = "iterative/iterative"
will download the latest release instead), and run terraform init --upgrade
This project and all contributions to it are distributed under