ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.39k stars 5.66k forks source link

[Autoscaler] The autoscaler could not find a node type to satisfy the request #27910

Closed peterghaddad closed 2 years ago

peterghaddad commented 2 years ago

What happened + What you expected to happen

I am running the PPO Trainer with num_workers set to 8. It seems when I first launch an experiment after creating a new Ray Cluster, then everything gets scheduled and there are no errors with the autoscaler. After the experiment completes and I re-submit the Job, then I get the following error.

The autoscaler could not find a node type to satisfy the request: [{"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}]

My CPU Worker has the following resources: Request 8 CPUs and Limit 16 CPUs with sufficient memory.

I was able to bypass this error by creating a new worker node called cpu-worker-small that has 1 CPU and 10GI of memory; however, then the actors fail unexpectedly (probably to resource constraints).

I saw some similar issues https://github.com/ray-project/ray/issues/12441, but it seems after the first experiment runs there is no resource demands in the autoscaler. I am using default settings from Kuberay.

After running the first experiment I check the following autoscaler state: This means that this issue https://github.com/ray-project/ray/issues/24259 made it into Ray 1.12.1

{'18ba94e4efd31538df7740fec64192e7': {'bundles': {0: {'CPU': 1.0},
                                                  1: {'CPU': 8.0},
                                                  2: {'CPU': 8.0},
                                                  3: {'CPU': 8.0},
                                                  4: {'CPU': 8.0},
                                                  5: {'CPU': 8.0},
                                                  6: {'CPU': 8.0},
                                                  7: {'CPU': 8.0},
                                                  8: {'CPU': 8.0}},
                                      'name': '__tune_5ba4d6b7__981c2416',
                                      'placement_group_id': '18ba94e4efd31538df7740fec64192e7',
                                      'state': 'REMOVED',
                                      'stats': {'end_to_end_creation_latency_ms': 0.0,
                                                'highest_retry_delay_ms': 1000.0,
                                                'scheduling_attempt': 171,
                                                'scheduling_latency_ms': 0.0,
                                                'scheduling_state': 'REMOVED'},
                                      'strategy': 'PACK'},
 '18bb3aa16ae8d9e9edd5da23ead26838': {'bundles': {0: {'CPU': 1.0},

Versions / Dependencies

Ray 2.0 Kuberay on a Kubernetes Cluster (latest version).

Reproduction script

PPO with the following Configuration: { "num_gpus" : 1, "num_workers" : 2, "num_sgd_iter" : 60, "train_batch_size" : 12000, }

Issue Severity

High: It blocks me from completing my task.

DmitriGekhtman commented 2 years ago

@krfricke do you think it could be connected to https://github.com/ray-project/ray/issues/24259

peterghaddad commented 2 years ago

Hi @DmitriGekhtman checking in from our Slack thread, still running into this problem when running jobs. Are there any thoughts?

DmitriGekhtman commented 2 years ago

I know this is a bit of work, but could you post full reproduction steps?

That would include the exact RayCluster config used and the scripts executed (perhaps with sensitive details redacted). It's ideal to have a reproduction of the issue that is as simple as possible.

peterghaddad commented 2 years ago

Hi @DmitriGekhtman, below are the following examples.

# This is adapted from https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.complete.yaml
# It is a general RayCluster that has most fields in it for maximum flexibility in the Ray/Kuberay integration MVP.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # An unique identifier for the head node and workers of this cluster.
  name: raycluster-complete
spec:
  rayVersion: "2.0"
  # With enableInTreeAutoscaling: true, the operator will insert an autoscaler sidecar container into the Ray head pod.
  enableInTreeAutoscaling: true
  ######################headGroupSpecs#################################
  # head group template and specs, (perhaps 'group' is not needed in the name)
  headGroupSpec:
    # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
    serviceType: ClusterIP
    # the pod replicas in this group typed head (assuming there could be more than 1 in the future)
    replicas: 1
    # logical group name, for this called head-group, also can be functional
    # pod type head or worker
    # rayNodeType: head # Not needed since it is under the headgroup
    # the following params are used to complete the ray start: ray start --head --block --port=6379 ...
    rayStartParams:
      # Flag "no-monitor" must be set when running the autoscaler in
      # a sidecar container.
      no-monitor: "true"
      port: "6379"
      object-manager-port: "9999"
      node-manager-port: "9998"
      object-store-memory: "100000000"
      dashboard-host: "0.0.0.0"
      node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
      block: "true"
      # Use `resources` to optionally specify custom resource annotations for the Ray node.
      # The value of `resources` is a string-integer mapping.
      # Currently, `resources` must be provided in the unfortunate format demonstrated below.
      # Moreover, "CPU" and "GPU" should NOT be included in the `resources` arg.
      # (Use `num-cpus` and `num-gpus` rayStartParams instead.)
      # resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
    #pod template
    template:
      metadata:
        labels:
          # custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
          # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
          rayCluster: raycluster # will be injected if missing
          rayNodeType: head # will be injected if missing, must be head or worker
          groupName: headgroup # will be injected if missing
      spec:
        containers:
          # The Ray head pod
          - name: ray-head
            # securityContext:
            #   runAsUser: 0
            # All Ray pods in the RayCluster should use the same version of Ray.
            image: ray:2.0.0-py39
            imagePullPolicy: Always
            # The KubeRay operator uses the ports specified on the ray-head container
            # to configure a service targeting the ports.
            # The name of the service is <ray cluster name>-head-svc.
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
            env:
              - name: CPU_REQUEST
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: requests.cpu
              - name: CPU_LIMITS
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: limits.cpu
              - name: MEMORY_LIMITS
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: limits.memory
              - name: MEMORY_REQUESTS
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: requests.memory
              - name: MY_POD_IP
                valueFrom:
                  fieldRef:
                    fieldPath: status.podIP
            lifecycle:
              preStop:
                exec:
                  command: ["/bin/sh", "-c", "ray stop"]
            resources:
              limits:
                cpu: "20"
                memory: "20Gi"
              requests:
                cpu: "10"
                memory: "20Gi"
  workerGroupSpecs:
    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 40
      groupName: cpu-group
      rayStartParams:
        node-ip-address: $MY_POD_IP
        block: "true"
      #pod template
      template:
        spec:
          initContainers:
            # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
            - name: init-myservice
              image: busybox:1.28
              command:
                [
                  "sh",
                  "-c",
                  "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done",
                ]
          containers:
            - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
              # All Ray pods in the RayCluster should use the same version of Ray.
              image: ray:2.0.0-py39
              env:
                - name: RAY_DISABLE_DOCKER_CPU_WARNING
                  value: "1"
                - name: TYPE
                  value: "worker"
                - name: CPU_REQUEST
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: requests.cpu
                - name: CPU_LIMITS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: limits.cpu
                - name: MEMORY_LIMITS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: limits.memory
                - name: MEMORY_REQUESTS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: requests.memory
                - name: MY_POD_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.name
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
              ports:
                - containerPort: 80
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh", "-c", "ray stop"]
              volumeMounts:
                - mountPath: /var/log
                  name: log-volume
              resources:
                limits:
                  cpu: "16"
                  memory: "28Gi"
                requests:
                  cpu: "1"
                  memory: "28Gi"
          volumes:
            - name: log-volume
              emptyDir: {}

And a simple script would be:

analysis = tune.run(
    "PPO",
    stop={"episode_reward_mean": 100},
    config={
        "env": "CartPole-v1",
        "num_gpus": 0,
        "num_workers": 4,
        "lr": tune.grid_search([0.01, 0.001]),
    },
    local_dir="/storage/"
)
DmitriGekhtman commented 2 years ago

Just re-confirming -- Is the following correct:

  1. The issue does occur if you use Ray Job Submission to submit the script. (It may also happen with other submission methods.)
  2. The issue occurs only when you run the script the second time?
peterghaddad commented 2 years ago
  1. That is correct, also seeing this when using the Ray Client.
  2. Only after a cluster is created. For example, if I launch a new cluster it scales okay the first time -> the experiment runs and finishes. Then I re-submit the same experiment and it is unable to scale (after waiting for resources to de-scale). Also, will happen for N number of times after.
DmitriGekhtman commented 2 years ago

Got it, so the sequence of events is

  1. Create cluster
  2. Submit workload
  3. Let workload run to completion, wait for scale-down
  4. Submit workload again
  5. Autoscaler produces a quizzical error about a bundle of CPU:0 requests and refuses to scale up

I will try this out to see what's going on.

peterghaddad commented 2 years ago

Exactly. FYI, the workers become scheduled if I manually scale them up.

Here are the additional logs:

Demands:
 {'CPU': 1.0} * 17 (PACK): 6+ pending placement groups
2022-08-29 14:19:36,207 WARNING resource_demand_scheduler.py:839 -- The autoscaler could not find a node type to satisfy the request: [{'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 
peterghaddad commented 2 years ago

@DmitriGekhtman Looks like the issue is when "workers" exist in the Cluster it gets confused. I triggered three jobs at the same time. The first one scheduled workers fine, but then I received the error I showed above after submitting the second and third job. I set max workers to 0 (to force delete all workers), then increased it back up to 40. After increasing this number, it spawned the required resources for the first job but not the others.

For example:

Demands:
 {'CPU': 1.0}: 2+ pending tasks/actors (2+ using placement groups)
 {'CPU': 1.0} * 17 (PACK): 25+ pending placement groups
2022-08-29 14:49:02,751 WARNING resource_demand_scheduler.py:839 -- The autoscaler could not find a node type to satisfy the request: [{'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CP
DmitriGekhtman commented 2 years ago

Huh, well the workers have 16CPU each so I guess it might make sense {'CPU': 1.0} * 17 (PACK) would have some trouble scheduling.

Pack would prioritize scheduling all 17 1-CPU requests on the same node.

peterghaddad commented 2 years ago

@DmitriGekhtman That makes sense; however, I am receiving the exact behavior when scaling each worker up.

I increased the CPU for each to 30 and the memory to 40; but if PACK is scheduling on the same machine then definitely not enough resources on that specific node.

However, the docs mention it will schedule elsewhere if unable to schedule onto a single node? Additionally to tune, here are some docs that are informative. . (Also posting for other who read this thread).

All provided bundles are packed onto a single node on a best-effort basis. If strict packing is not feasible (i.e., some bundles do not fit on the node), bundles can be placed onto other nodes nodes.

Also, below is when I scale up CPUs.

Demands:
 {'CPU': 1.0} * 9 (PACK): 10+ pending placement groups
2022-08-29 15:14:55,032 WARNING resource_demand_scheduler.py:839 -- The autoscaler could not find a node type to satisfy the request: [{'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0},
DmitriGekhtman commented 2 years ago

Yeah, it is true that PACK is a "soft" constraint... starting to poke around now.

DmitriGekhtman commented 2 years ago

Just for purposes of documenting my progress here: In the config above, I substituted the image rayproject/ray-ml:2.0.0-py38 and in the script, I removed the argument local_dir="/storage/".

DmitriGekhtman commented 2 years ago

Modified config:

# This is adapted from https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.complete.yaml
# It is a general RayCluster that has most fields in it for maximum flexibility in the Ray/Kuberay integration MVP.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # An unique identifier for the head node and workers of this cluster.
  name: raycluster-complete
spec:
  rayVersion: "2.0"
  # With enableInTreeAutoscaling: true, the operator will insert an autoscaler sidecar container into the Ray head pod.
  enableInTreeAutoscaling: true
  ######################headGroupSpecs#################################
  # head group template and specs, (perhaps 'group' is not needed in the name)
  headGroupSpec:
    # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
    serviceType: ClusterIP
    # the pod replicas in this group typed head (assuming there could be more than 1 in the future)
    replicas: 1
    # logical group name, for this called head-group, also can be functional
    # pod type head or worker
    # rayNodeType: head # Not needed since it is under the headgroup
    # the following params are used to complete the ray start: ray start --head --block --port=6379 ...
    rayStartParams:
      # Flag "no-monitor" must be set when running the autoscaler in
      # a sidecar container.
      no-monitor: "true"
      port: "6379"
      object-manager-port: "9999"
      node-manager-port: "9998"
      object-store-memory: "100000000"
      dashboard-host: "0.0.0.0"
      node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
      block: "true"
      # Use `resources` to optionally specify custom resource annotations for the Ray node.
      # The value of `resources` is a string-integer mapping.
      # Currently, `resources` must be provided in the unfortunate format demonstrated below.
      # Moreover, "CPU" and "GPU" should NOT be included in the `resources` arg.
      # (Use `num-cpus` and `num-gpus` rayStartParams instead.)
      # resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
    #pod template
    template:
      metadata:
        labels:
          # custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
          # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
          rayCluster: raycluster # will be injected if missing
          rayNodeType: head # will be injected if missing, must be head or worker
          groupName: headgroup # will be injected if missing
      spec:
        containers:
          # The Ray head pod
          - name: ray-head
            # securityContext:
            #   runAsUser: 0
            # All Ray pods in the RayCluster should use the same version of Ray.
            image: rayproject/ray-ml:2.0.0-py38
            imagePullPolicy: Always
            # The KubeRay operator uses the ports specified on the ray-head container
            # to configure a service targeting the ports.
            # The name of the service is <ray cluster name>-head-svc.
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
            env:
              - name: CPU_REQUEST
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: requests.cpu
              - name: CPU_LIMITS
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: limits.cpu
              - name: MEMORY_LIMITS
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: limits.memory
              - name: MEMORY_REQUESTS
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: requests.memory
              - name: MY_POD_IP
                valueFrom:
                  fieldRef:
                    fieldPath: status.podIP
            lifecycle:
              preStop:
                exec:
                  command: ["/bin/sh", "-c", "ray stop"]
            resources:
              limits:
                cpu: "20"
                memory: "20Gi"
              requests:
                cpu: "10"
                memory: "20Gi"
  workerGroupSpecs:
    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 40
      groupName: cpu-group
      rayStartParams:
        node-ip-address: $MY_POD_IP
        block: "true"
      #pod template
      template:
        spec:
          initContainers:
            # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
            - name: init-myservice
              image: busybox:1.28
              command:
                [
                  "sh",
                  "-c",
                  "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done",
                ]
          containers:
            - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
              # All Ray pods in the RayCluster should use the same version of Ray.
              image: rayproject/ray-ml:2.0.0-py38
              env:
                - name: RAY_DISABLE_DOCKER_CPU_WARNING
                  value: "1"
                - name: TYPE
                  value: "worker"
                - name: CPU_REQUEST
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: requests.cpu
                - name: CPU_LIMITS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: limits.cpu
                - name: MEMORY_LIMITS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: limits.memory
                - name: MEMORY_REQUESTS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: requests.memory
                - name: MY_POD_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.name
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
              ports:
                - containerPort: 80
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh", "-c", "ray stop"]
              volumeMounts:
                - mountPath: /var/log
                  name: log-volume
              resources:
                limits:
                  cpu: "16"
                  memory: "28Gi"
                requests:
                  cpu: "1"
                  memory: "28Gi"
          volumes:
            - name: log-volume
              emptyDir: {}
DmitriGekhtman commented 2 years ago

Modified script:

from ray import tune

analysis = tune.run(
    "PPO",
    stop={"episode_reward_mean": 100},
    config={
        "env": "CartPole-v1",
        "num_gpus": 0,
        "num_workers": 4,
        "lr": tune.grid_search([0.01, 0.001]),
    },
)
DmitriGekhtman commented 2 years ago

Now, my issue is that I there weren't enough tasks triggered to cause upscaling, and the job ran to submission successfully.

@peterghaddad could you suggest some parameters for the tune job that more closely resemble the ones for your workload?

Maybe more workers or more CPUs per worker? And/or more computationally intensive trials?

peterghaddad commented 2 years ago

Hi @DmitriGekhtman, sure thing!

import numpy as np
from ray import tune

analysis = tune.run(
    "PPO",
    stop={"episode_reward_mean": 200000},
    config={
        "env": "CartPole-v1",
        "num_gpus": 0,
        "num_workers": 8,
        "lr": tune.grid_search(list(np.arange(0.0, 0.999, 0.2)))
    },
)

I have submitted jobs back-to-back (2-3). Is there any other information which may be helpful?

peterghaddad commented 2 years ago

Hey @DmitriGekhtman I triggered some additional experiments with minimum resources after submitting one that required larger resources (what I posted above)

Same error, but it's only trying to place 4 CPUs for the smaller experiment.

{'CPU': 1.0} * 4 (PACK): 1+ pending placement groups

Below is the code:

from ray.air.config import RunConfig, ScalingConfig
from ray.train.rl import RLTrainer

trainer = RLTrainer(
    run_config=RunConfig(stop={"training_iteration": 1}),
    scaling_config=ScalingConfig(num_workers=2, use_gpu=False),
    algorithm="PPO",
    config={
        "env": "CartPole-v1",
        "framework": "tf",
        "evaluation_num_workers": 1,
        "evaluation_interval": 1,
        "evaluation_config": {"input": "sampler"},
    },
)

t = trainer.fit()

I also tried specifying placement_strategy="SPREAD" for the scaling_config, but it is registering as PACK so haven't been able to test if SPREAD works.

DmitriGekhtman commented 2 years ago

back-to-back

The submissions were within seconds of each other? Or back-to-back meaning you waited for one two complete before submitting another?

DmitriGekhtman commented 2 years ago

Also, could you share what your underlying K8s node infrastructure looks like?

DmitriGekhtman commented 2 years ago

I've just submitted 3 instances of the tuning job to a Ray cluster running on GKE autopilot. Will see what happens!

DmitriGekhtman commented 2 years ago

I'm seeing upscaling with {'CPU': 1.0} * 9 (PACK): 10+ pending placement groups -- I imagine those should be scheduled successfully, but we'll see.

DmitriGekhtman commented 2 years ago

Submitting three of these jobs in quick succession resulted in upscaling of 8 Ray worker pods. No errors yet -- will just let those jobs keep running for a while and then will try submitting again after they finish running and I see downscaling.

By the way, running ray status in the head pod could reveal useful autoscaling information. Autoscaler logs could also be useful to see -- these logs stream to the autoscaler container's stdout and to /tmp/ray/session_latest/logs/monitor.log in the container's file system.

DmitriGekhtman commented 2 years ago

I'm unfortunately not able to reproduce the issue yet -- if you could share the autoscaler's logs, that would be helpful.

peterghaddad commented 2 years ago

Hi @DmitriGekhtman, thank you for troubleshooting!

back-to-back meaning I submit 2-3 jobs at the same time about a minute apart. The first one scales, but then is unable to scale the second and third.

As an FYI, I made my workers contain more CPU and memory hence less nodes. Let me know if you would like me to revert and try again. It is now:

           resources:
                limits:
                  cpu: '30'
                  memory: 40Gi
                requests:
                  cpu: '30'
                  memory: 40Gi

I ran ray status on the head-node and this is what I am receiving:

======== Autoscaler status: 2022-08-30 20:38:48.731382 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head-group
 3 cpu-group
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 91.0/110.0 CPU (91.0 used of 108.0 reserved in placement groups)
 0.0/32.0 GPU
 0.00/140.000 GiB memory
 0.00/36.004 GiB object_store_memory

Demands:
 {'CPU': 1.0} * 9 (PACK): 10+ pending placement groups. 2022-08-30 20:44:25,562 WARNING resource_demand_scheduler.py:839 -- The autoscaler could not find a node type to satisfy the request: [{'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}]. Please specify a node type with the necessary resources.

I'm kicking off another job which requires less CPUs now.

Below are the autoscaling logs for this job.

DmitriGekhtman commented 2 years ago

I wonder if it has to do with the presence of GPUs -- there is some logic in the autoscaler which tries to avoid scaling up GPU nodes for CPU tasks.

What if you set

rayStartParams:
    num-gpus: "0"

for the head and workers?

DmitriGekhtman commented 2 years ago

I have a very strong hunch that this is what happened:

After upscaling a worker the first time, the autoscaler detected from the running worker's load metrics that the worker has access to GPUs. After realizing that the worker node type has GPUs, the autoscaler tried to avoid scaling up the GPU nodes to schedule CPU tasks, which prevented all further upscaling.

In other words, it's likely an instance of this issue: https://github.com/ray-project/ray/issues/20476 whose priority I will now bump up.

I do expect that setting the override

rayStartParams:
    num-gpus: "0"

will resolve the issue in this case.

peterghaddad commented 2 years ago

Would you look at that!

peterghaddad commented 2 years ago

Solved it! I appreciate all of the help.

DmitriGekhtman commented 2 years ago

Glad we got to the bottom of it!

Closing this as a duplicate of the underlying issue.