ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[core] autoscaler - TPU is giving the special "TPU-{type}-head" resource to workers, not just the head. #47769

Open dlwh opened 4 hours ago

dlwh commented 4 hours ago

What happened + What you expected to happen

Somehow all TPU slice workers are getting the special TPU-{type}-head resource, despite the fact that it should only be going to the actual head. I'm very confused, since the code seems pretty clear but nevertheless you can see that we have two TPU slices but 8 tpu heads (which is one per worker)

image

Versions / Dependencies

Tested against Ray 2.34 and 2.36

Reproduction script

This cluster yaml reproduces the issue

cluster_name: my-cluster

# Configure GCP
provider:
  type: gcp
  region: XXXX
  availability_zone: XXXX
  project_id:  XXXX

# Maximum Workers (excluding Head Node)
max_workers: 1024
upscaling_speed: 4.0  # for bursty

# List of Available Node Types
available_node_types:
  # Head Node =>> On-Demand, sets Min/Max Workers = 0 (Prevent Scheduling Tasks on Head Node)
  head_default:
    min_workers: 0
    max_workers: 0
    resources: {"CPU": 32}

    # GCP-Specific Configuration; by default, Ray will configure unspecified fields (e.g., subnets, ssh-keys)
    #   => Ref: https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
    node_config:
      machineType: n2-standard-8

      # Create a Persistent Disk w/ 200 GBs
      disks:
        - boot: true
          autoDelete: true
          type: PERSISTENT
          initializeParams:
            diskSizeGb: 200

            # Set Source Image =>> Ubuntu 22.04 Base VM
            sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
#            sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu

  # Worker Nodes =>>
  tpu_slice_v4_32:
    min_workers: 2
    max_workers: 1024
    resources: { "CPU": 120, "TPU": 4 }

    node_config:
      acceleratorType: v4-32
      runtimeVersion: tpu-ubuntu2204-base

      # [IMPORTANT] Configure all TPU Workers to be Preemptible!
      schedulingConfig:
        preemptible: true

docker:
    image: "ghcr.io/stanford-crfm/levanter-cluster:extra-resource"
    container_name: "ray_docker"
    pull_before_run: true
    worker_run_options:
        - --privileged
        - --ulimit memlock=-1:-1  #
        - --shm-size=32gb
        - -v "/tmp:/tmp"
        # this lets the worker run docker commands and have them run as sibling containers
        - -v "/var/run/docker.sock:/var/run/docker.sock"

initialization_commands:
  - yes | gcloud auth configure-docker us-central2-docker.pkg.dev
  - which docker || (curl -fsSL https://get.docker.com -o get-docker.sh; sudo sh get-docker.sh; sudo usermod -aG docker $USER; sudo systemctl restart docker -f)
  # always run this because ray doesn't run with sudo
  - sudo usermod -aG docker $USER
  # we want to launch docker containers from inside docker, which means we need to loosen the permissions on the docker
  # socket. This isn't the best security practice, but it's the easiest way to get this working.
  - sudo chmod 666 /var/run/docker.sock

head_setup_commands:
  - mkdir $HOME/.cache/huggingface -p
  - gcloud secrets versions access latest --secret=HF_TOKEN > $HOME/.cache/huggingface/token || true

worker_setup_commands:
  - mkdir $HOME/.cache/huggingface -p
  - gcloud secrets versions access latest --secret=HF_TOKEN > $HOME/.cache/huggingface/token || true

# Set Head Node == `ray_head_default`
head_node_type: head_default

Issue Severity

Medium: It is a significant difficulty but I can work around it.

dlwh commented 4 hours ago

cc @allenwang28

dlwh commented 4 hours ago

ok this has something to do with docker but i don't know what. Without docker this is fine. I'm guessing an env variable but i'm not sure

dlwh commented 2 hours ago

Ok so the issue is that TPUVMDockerCommandRunner only overrides the ssh_command_runner to suppress the excess TPU-XXX-head, but the way that resource sneaks in is via the docker run command's explicit _with_environment_variables call that bypasses the env handling in the ssh runner