[Core] Represent TPUs as a Native Resource

allenwang28 commented 1 year ago

Description

Using accelerators.md as a reference point, we can graduate TPUs from custom resources (i.e. marked as --resources={"TPU": 1}) to native resources.

To teach Ray how to detect TPUs, within TPU VMs, we can actually check if TPU drivers exist by polling /dev/accel*, e.g.:

$ ls /dev/accel*
/dev/accel0  /dev/accel1  /dev/accel2  /dev/accel3

or in Python:

import glob
num_tpus = len(glob.glob("/dev/accel*"))

We can also get the accelerator version like this:

def get_metadata(key):
  return requests.get(
      'http://metadata.google.internal/computeMetadata'
      '/v1/instance/attributes/{}'.format(key),
      headers={
          'Metadata-Flavor': 'Google'
      }).text

accelerator_type = get_metadata('accelerator-type') # e.g. returns v2-8
accelerator_version = accelerator_type.split("-")[0] # e.g. we know it's v2

but caveat: this approach relies on metadata attached to a TPU VM instance which will not work on GKE/KubeRay.

Some inconsistencies with GPUs that can be tricky:

Multiplexing is not well supported on TPU VMs
TPU hardware can really only be accessed through the ML framework through XLA, so "Supporting worker assignment to specific devices on a node" will actually not really possible (nor desirable?) at the moment without considerable design work.

Use case

Instead of

ray start --resources={"TPU": 1}

and

@ray.remote(resources={"TPU": 1})
def function():
  ...

we can instead do

ray start --num_tpus=4

and

@ray.remote(num_tpus=4)
def function()
  ...

allenwang28 commented 1 year ago

I'm happy to take this one, I mostly intended to open this as a way to track this request!

xieus commented 1 year ago

@allenwang28 Thank you for the contribution. Let us know how it goes.

cc: @richardliaw

cadedaniel commented 1 year ago

This fits into recent XLA work @scv119 has been reviewing.

ray-project / ray

[Core] Represent TPUs as a Native Resource #38085

Description

Use case