google / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
69 stars 17 forks source link
gcloud gke tpu

Build Tests Nightly Tests

Overview

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE. xpk handles the "multihost pods" of TPUs, GPUs (HGX H100) and CPUs (n2-standard-32) as first class citizens.

xpk decouples provisioning capacity from running jobs. There are two structures: clusters (provisioned VMs) and workloads (training jobs). Clusters represent the physical resources you have available. Workloads represent training jobs -- at any time some of these will be completed, others will be running and some will be queued, waiting for cluster resources to become available.

The ideal workflow starts by provisioning the clusters for all of the ML hardware you have reserved. Then, without re-provisioning, submit jobs as needed. By eliminating the need for re-provisioning between jobs, using Docker containers with pre-installed dependencies and cross-ahead of time compilation, these queued jobs run with minimal start times. Further, because workloads return the hardware back to the shared pool when they complete, developers can achieve better use of finite hardware resources. And automated tests can run overnight while resources tend to be underutilized.

xpk supports the following TPU types:

and the following GPU types:

and the following CPU types:

Installation

To install xpk, run the following command:

pip install xpk

If you are running XPK by cloning GitHub repository, first run the following commands to begin using XPK commands:

git clone https://github.com/google/xpk.git
cd xpk
# Install dependencies such as cloud-accelerator-diagnostics
pip install .

If you see an error saying: This environment is externally managed, please use a virtual environment.

Example:

  ## One time step of creating the venv
  VENV_DIR=~/venvp3
  python3 -m venv $VENV_DIR
  ## Enter your venv.
  source $VENV_DIR/bin/activate
  ## Clone the repository and installing dependencies.
  git clone https://github.com/google/xpk.git
  cd xpk
  # Install dependencies such as cloud-accelerator-diagnostics
  pip install .

XPK for Large Scale (>1k VMs)

Follow user instructions in xpk-large-scale-guide.sh to use xpk for a GKE cluster greater than 1000 VMs. Run these steps to set up a GKE cluster with large scale training and high throughput support with XPK, and run jobs with XPK. We recommend you manually copy commands per step and verify the outputs of each step.

Example usages:

To get started, be sure to set your GCP Project and Zone as usual via gcloud config set.

Below are reference commands. A typical journey starts with a Cluster Create followed by many Workload Creates. To understand the state of the system you might want to use Cluster List or Workload List commands. Finally, you can cleanup with a Cluster Delete.

If you have failures with workloads not running, use xpk inspector to investigate more.

Cluster Create

First set the project and zone through gcloud config or xpk arguments.

PROJECT_ID=my-project-id
ZONE=us-east5-b
# gcloud config:
gcloud config set project $PROJECT_ID
gcloud config set compute/zone $ZONE
# xpk arguments
xpk .. --zone $ZONE --project $PROJECT_ID

The cluster created is a regional cluster to enable the GKE control plane across all zones.

Create Vertex AI Tensorboard

Note: This feature is available in XPK >= 0.4.0. Enable Vertex AI API in your Google Cloud console to use this feature. Make sure you have Vertex AI Administrator role assigned to your user account.

Vertex AI Tensorboard is a fully managed version of open-source Tensorboard. To learn more about Vertex AI Tensorboard, visit this. Note that Vertex AI Tensorboard is only available in these regions.

You can create a Vertex AI Tensorboard for your cluster with Cluster Create command. XPK will create a single Vertex AI Tensorboard instance per cluster.

python3 xpk.py cluster create \
--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
--create-vertex-tensorboard

will create a Vertex AI Tensorboard with the name xpk-test-tb-instance (-tb-instance) in us-central1 (default region).

python3 xpk.py cluster create \
--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
--create-vertex-tensorboard --tensorboard-region=us-west1

will create a Vertex AI Tensorboard with the name xpk-test-tb-instance (-tb-instance) in us-west1.

python3 xpk.py cluster create \
--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
--create-vertex-tensorboard --tensorboard-name=tb-testing

will create a Vertex AI Tensorboard with the name tb-testing in us-central1.

python3 xpk.py cluster create \
--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
--create-vertex-tensorboard --tensorboard-region=us-west1 --tensorboard-name=tb-testing

will create a Vertex AI Tensorboard instance with the name tb-testing in us-west1.

python3 xpk.py cluster create \
--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
--create-vertex-tensorboard --tensorboard-region=us-central2

will fail the cluster creation process because Vertex AI Tensorboard is not supported in us-central2.

Cluster Delete

Cluster Cacheimage

Workload Create

Set max-restarts for production jobs

Workload Priority and Preemption

Create Vertex AI Experiment to upload data to Vertex AI Tensorboard

Note: This feature is available in XPK >= 0.4.0. Enable Vertex AI API in your Google Cloud console to use this feature. Make sure you have Vertex AI Administrator role assigned to your user account and to the Compute Engine Service account attached to the node pools in the cluster.

Vertex AI Experiment is a tool that helps to track and analyze an experiment run on Vertex AI Tensorboard. To learn more about Vertex AI Experiments, visit this.

XPK will create a Vertex AI Experiment in workload create command and attach the Vertex AI Tensorboard created for the cluster during cluster create. If there is a cluster created before this feature is released, there will be no Vertex AI Tensorboard created for the cluster and workload create will fail. Re-run cluster create to create a Vertex AI Tensorboard and then run workload create again to schedule your workload.

python3 xpk.py workload create \
--cluster xpk-test --workload xpk-workload \
--use-vertex-tensorboard

will create a Vertex AI Experiment with the name xpk-test-xpk-workload (-).

python3 xpk.py workload create \
--cluster xpk-test --workload xpk-workload \
--use-vertex-tensorboard --experiment-name=test-experiment

will create a Vertex AI Experiment with the name test-experiment.

Check out MaxText example on how to update your workload to automatically upload logs collected in your Tensorboard directory to the Vertex AI Experiment created by workload create.

Workload Delete

Workload List

Inspector

GPU usage

In order to use XPK for GPU, you can do so by using device-type flag.

CPU usage

In order to use XPK for CPU, you can do so by using device-type flag.

Autoprovisioning with XPK

XPK can dynamically allocate cluster capacity using Node Auto Provisioning, (NAP) support.

Allow several topology sizes to be supported from one XPK cluster, and be dynamically provisioned based on incoming workload requests. XPK users will not need to re-provision the cluster manually.

Enabling autoprovisioning will take the cluster around initially up to 30 minutes to upgrade.

Create a cluster with autoprovisioning:

Autoprovisioning will be enabled on the below cluster with [0, 8] chips of v4 TPU (up to 1xv4-16) to scale between.

XPK doesn't currently support different generations of accelerators in the same cluster (like v4 and v5p TPUs).

CLUSTER_NAME=my_cluster
NUM_SLICES=2
DEVICE_TYPE=v4-8
RESERVATION=reservation_id
PROJECT=my_project
ZONE=us-east5-b

python3 xpk.py cluster create \
  --cluster $CLUSTER_NAME \
  --num-slices=$NUM_SLICES \
    --device-type=$DEVICE_TYPE \
  --zone=$ZONE \
  --project=$PROJECT \
  --reservation=$RESERVATION \
  --enable-autoprovisioning
  1. Define the starting accelerator configuration and capacity type.

    --device-type=$DEVICE_TYPE \
    --num-slice=$NUM_SLICES
  2. Optionally set custom minimum / maximum chips. NAP will rescale the cluster with maximum - minimum chips. By default, maximum is set to the current cluster configuration size, and minimum is set to 0. This allows NAP to rescale with all the resources.

    --autoprovisioning-min-chips=$MIN_CHIPS \
    --autoprovisioning-max-chips=$MAX_CHIPS
  3. FEATURE TO COME SOON: Set the timeout period for when node pools will automatically be deleted if no incoming workloads are run. This is 10 minutes currently.

  4. FEATURE TO COME: Set the timeout period to infinity. This will keep the idle node pool configuration always running until updated by new workloads.

Update a cluster with autoprovisioning:

CLUSTER_NAME=my_cluster
NUM_SLICES=2
DEVICE_TYPE=v4-8
RESERVATION=reservation_id
PROJECT=my_project
ZONE=us-east5-b

python3 xpk.py cluster create \
  --cluster $CLUSTER_NAME \
  --num-slices=$NUM_SLICES \
    --device-type=$DEVICE_TYPE \
  --zone=$ZONE \
  --project=$PROJECT \
  --reservation=$RESERVATION \
  --enable-autoprovisioning

Update a previously autoprovisioned cluster with different amount of chips:

CLUSTER_NAME=my_cluster
NUM_SLICES=2
DEVICE_TYPE=v4-16
RESERVATION=reservation_id
PROJECT=my_project
ZONE=us-east5-b

# This will create 2x v4-16 node pools and set the max autoprovisioned chips to 16.
python3 xpk.py cluster create \
  --cluster $CLUSTER_NAME \
  --num-slices=$NUM_SLICES \
    --device-type=$DEVICE_TYPE \
  --zone=$ZONE \
  --project=$PROJECT \
  --reservation=$RESERVATION \
  --enable-autoprovisioning

This will clear the node pools if they exist in the cluster and set the max autoprovisioned chips to 16

python3 xpk.py cluster create \ --cluster $CLUSTER_NAME \ --num-slices=$NUM_SLICES \ --device-type=$DEVICE_TYPE \ --zone=$ZONE \ --project=$PROJECT \ --reservation=$RESERVATION \ --enable-autoprovisioning \ --autoprovisioning-max-chips 16


## Run workloads on the cluster with autoprovisioning:
Reconfigure the `--device-type` and `--num-slices`
  ```shell
  CLUSTER_NAME=my_cluster
  NUM_SLICES=2
  DEVICE_TYPE=v4-8
  NEW_RESERVATION=new_reservation_id
  PROJECT=my_project
  ZONE=us-east5-b
  # Create a 2x v4-8 TPU workload.
  python3 xpk.py workload create \
    --cluster $CLUSTER \
    --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
    --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
    --device-type=$DEVICE_TYPE \
    --num-slices=$NUM_SLICES \
    --zone=$ZONE \
    --project=$PROJECT

  NUM_SLICES=1
  DEVICE_TYPE=v4-16

  # Create a 1x v4-16 TPU workload.
  python3 xpk.py workload create \
    --cluster $CLUSTER \
    --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
    --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
    --device-type=$DEVICE_TYPE \
    --num-slices=$NUM_SLICES \
    --zone=$ZONE \
    --project=$PROJECT

  # Use a different reservation from what the cluster was created with.
  python3 xpk.py workload create \
    --cluster $CLUSTER \
    --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
    --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
    --device-type=$DEVICE_TYPE \
    --num-slices=$NUM_SLICES \
    --zone=$ZONE \
    --project=$PROJECT \
    --reservation=$NEW_RESERVATION
  1. (Optional) Define the capacity type. By default, the capacity type will match with what the cluster was created with.

    --reservation=my-reservation-id | --on-demand | --spot
  2. Set the topology of your workload using --device-type.

    NUM_SLICES=1
    DEVICE_TYPE=v4-8
    --device-type=$DEVICE_TYPE \
    --num-slices=$NUM_SLICES \

How to add docker images to a xpk workload

The default behavior is xpk workload create will layer the local directory (--script-dir) into the base docker image (--base-docker-image) and run the workload command. If you don't want this layering behavior, you can directly use --docker-image. Do not mix arguments from the two flows in the same command.

Recommended / Default Docker Flow: --base-docker-image and --script-dir

This flow pulls the --script-dir into the --base-docker-image and runs the new docker image.

Optional Direct Docker Image Configuration: --docker-image

If a user wants to directly set the docker image used and not layer in the current working directory, set --docker-image to the image to be use in the workload.

More advanced facts:

Integration Test Workflows

The repository code is tested through Github Workflows and Actions. Currently three kinds of tests are performed:

More information is documented here

Troubleshooting

Invalid machine type for CPUs.

XPK will create a regional GKE cluster. If you see issues like

Invalid machine type e2-standard-32 in zone $ZONE_NAME

Please select a CPU type that exists in all zones in the region.

# Find CPU Types supported in zones.
gcloud compute machine-types list --zones=$ZONE_LIST
# Adjust default cpu machine type.
python3 xpk.py cluster create --default-pool-cpu-machine-type=CPU_TYPE ...

Permission Issues: requires one of ["permission_name"] permission(s).

1) Determine the role needed based on the permission error:

```shell
# For example: `requires one of ["container.*"] permission(s)`
# Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
```

2) Add the role to the user in your project.

Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli:
```shell
PROJECT_ID=my-project-id
CURRENT_GKE_USER=$(gcloud config get account)
ROLE=roles/container.admin  # container.admin is the role needed for Kubernetes Engine Admin
gcloud projects add-iam-policy-binding $PROJECT_ID --member user:$CURRENT_GKE_USER --role=$ROLE
```

3) Check the permissions are correct for the users.

Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli:

```shell
PROJECT_ID=my-project-id
CURRENT_GKE_USER=$(gcloud config get account)
gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$CURRENT_GKE_USER" --flatten="bindings[].members"
```

4) Confirm you have logged in locally with the correct user.

```shell
gcloud auth login
```

Roles needed based on permission errors:

Reservation Troubleshooting:

How to determine your reservation and its size / utilization:

PROJECT_ID=my-project
ZONE=us-east5-b
RESERVATION=my-reservation-name
# Find the reservations in your project
gcloud beta compute reservations list --project=$PROJECT_ID
# Find the tpu machine type and current utilization of a reservation.
gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE

TPU Workload Debugging

Verbose Logging

If you are having trouble with your workload, try setting the --enable-debug-logs when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example:

python3 xpk.py workload create \
--cluster --workload xpk-test-workload \
--command="echo hello world" --enable-debug-logs

Please check libtpu logging and Tensorflow logging for more information about the flags that are enabled to get the logs.

Collect Stack Traces

cloud-tpu-diagnostics PyPI package can be used to generate stack traces for workloads running in GKE. This package dumps the Python traces when a fault such as segmentation fault, floating-point exception, or illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help you debug situations when the program is unresponsive. You must make the following changes in the docker image running in a Kubernetes main container to enable periodic stack trace collection.

# main.py

from cloud_tpu_diagnostics import diagnostic
from cloud_tpu_diagnostics.configuration import debug_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration

stack_trace_config = stack_trace_configuration.StackTraceConfig(
                      collect_stack_trace = True,
                      stack_trace_to_cloud = True)
debug_config = debug_configuration.DebugConfig(
                stack_trace_config = stack_trace_config)
diagnostic_config = diagnostic_configuration.DiagnosticConfig(
                      debug_config = debug_config)

with diagnostic.diagnose(diagnostic_config):
    main_method()  # this is the main method to run

This configuration will start collecting stack traces inside the /tmp/debugging directory on each Kubernetes Pod.

Explore Stack Traces

To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from /tmp/debugging directory.

 python3 xpk.py workload create \
  --workload xpk-test-workload --command "python3 main.py" --cluster \
  xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar

Other advanced usage

Use a Jupyter notebook to interact with a Cloud TPU cluster