iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
290 stars 27 forks source link
aws azure cloud cloud-computing cloud-infrastructure cloud-orchestration cloud-storage cml data-science developer-tools gcp gpu hacktoberfest k8s machine-learning mlops terraform terraform-provider terraform-provider-iterative tpi

TPI

Terraform Provider Iterative (TPI)

docs tests Apache-2.0

TPI is a Terraform plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.

Supported cloud vendors include:

Amazon Web Services (AWS) Microsoft Azure Google Cloud Platform (GCP) Kubernetes (K8s)

Why TPI?

There are several reasons to use TPI instead of other related solutions (custom scripts and/or cloud orchestrators):

  1. Reduced management overhead and infrastructure cost: TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running — auto-recovery happens even if you are offline.
  2. Unified tool for data science and software development teams: TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.
  3. Reproducible, codified environments: Store hardware requirements in a single configuration file alongside the rest of your ML pipeline code.

[^scalers]: AWS Auto Scaling Groups, Azure VM Scale Sets, GCP managed instance groups, and Kubernetes Jobs.

TPI is used to power CML, bringing cloud providers to existing GitHub, GitLab & Bitbucket CI/CD workflows (repository).

Usage

Requirements

Define a Task

In a project root directory, create a file named main.tf with the following contents:

terraform {
  required_providers { iterative = { source = "iterative/iterative" } }
}
provider "iterative" {}

resource "iterative_task" "example" {
  cloud      = "aws" # or any of: gcp, az, k8s
  machine    = "m"   # medium. Or any of: l, xl, m+k80, xl+v100, ...
  spot       = 0     # auto-price. Default -1 to disable, or >0 for hourly USD limit
  disk_size  = -1    # GB. Default -1 for automatic

  storage {
    workdir = "."       # default blank (don't upload)
    output  = "results" # default blank (don't download). Relative to workdir
  }
  script = <<-END
    #!/bin/bash

    # create output directory if needed
    mkdir -p results
    # read last result (in case of spot/preemptible instance recovery)
    if test -f results/epoch.txt; then EPOCH="$(cat results/epoch.txt)"; fi
    EPOCH=$${EPOCH:-1}  # start from 1 if last result not found

    echo "(re)starting training loop from $EPOCH up to 1337 epochs"
    for epoch in $(seq $EPOCH 1337); do
      sleep 1
      echo "$epoch" | tee results/epoch.txt
    done
  END
}

See the reference for the full list of options for main.tf -- including more information on machine types with and without GPUs.

console

Run this once (in the directory containing main.tf) to download the required_providers:

terraform init
export TF_LOG_PROVIDER=INFO

Run Task

terraform apply

This launches a machine in the cloud, uploads workdir, and runs the script. Upon completion (or error), the machine is terminated.

With spot/preemptible instances (spot >= 0), auto-recovery logic and persistent (disk_size) storage will be used to relaunch interrupted tasks.

Query Status

Results and logs are periodically synced to persistent cloud storage. To query this status and view logs:

terraform refresh
terraform show

End Task

terraform destroy

This terminates the machine (if still running), downloads output, and removes the persistent disk_size storage.

Example Projects

How it Works

This diagram may help to see what TPI does under-the-hood:

flowchart LR
subgraph tpi [what TPI manages]
direction LR
    subgraph you [what you manage]
        direction LR
        A([Personal Computer])
    end
    B[("Cloud Storage (low cost)")]
    C{{"Cloud instance scaler (zero cost)"}}
    D[["Cloud (spot) Instance"]]
    A ---> |2. create cloud storage| B
    A --> |1. create cloud instance scaler| C
    A ==> |3. upload script & workdir| B
    A -.-> |"4. offline (lunch break)"| A
    C -.-> |"5. (re)provision instance"| D
    D ==> |7. run script| D
    B <-.-> |6. persistent workdir cache| D
    D ==> |8. script end,\nshutdown instance| B
    D -.-> |outage| C
    B ==> |9. download output| A
end
style you fill:#FFFFFF00,stroke:#13ADC7
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px
style A fill:#13ADC7,stroke:#333333,color:#000000
style B fill:#945DD5,stroke:#333333,color:#000000
style D fill:#F46737,stroke:#333333,color:#000000
style C fill:#7B61FF,stroke:#333333,color:#000000

Future Plans

TPI is a CLI tool bringing the power of bare-metal cloud to a bare-metal local laptop. We're working on more featureful and visual interfaces. We'd also like to have more native support for distributed (multi-instance) training, more data sync optimisations & options, and tighter ecosystem integration with tools such as DVC. Plus of course more examples for Data Scientists and Machine Learning Engineers - from Jupyter, VSCode, and Codespaces to improving the live logging/monitoring/reporting experience.

Help

The getting started guide has some more information. In case of errors, extra debugging information is available using TF_LOG_PROVIDER=DEBUG instead of INFO.

Feature requests and bugs can be reported via GitHub issues, while general questions and feedback are very welcome on our active Discord server.

Contributing

Instead of using the latest stable release, a local copy of the repository must be used.

  1. Install Go 1.17+
  2. Clone the repository & build the provider
    git clone https://github.com/iterative/terraform-provider-iterative
    cd terraform-provider-iterative
    make install
  3. Use source = "github.com/iterative/iterative" in your main.tf to use the local repository (source = "iterative/iterative" will download the latest release instead), and run terraform init --upgrade

Copyright

This project and all contributions to it are distributed under Apache-2.0