Closed peterghaddad closed 2 years ago
@krfricke do you think it could be connected to https://github.com/ray-project/ray/issues/24259
Hi @DmitriGekhtman checking in from our Slack thread, still running into this problem when running jobs. Are there any thoughts?
I know this is a bit of work, but could you post full reproduction steps?
That would include the exact RayCluster config used and the scripts executed (perhaps with sensitive details redacted). It's ideal to have a reproduction of the issue that is as simple as possible.
Hi @DmitriGekhtman, below are the following examples.
# This is adapted from https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.complete.yaml
# It is a general RayCluster that has most fields in it for maximum flexibility in the Ray/Kuberay integration MVP.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
labels:
controller-tools.k8s.io: "1.0"
# An unique identifier for the head node and workers of this cluster.
name: raycluster-complete
spec:
rayVersion: "2.0"
# With enableInTreeAutoscaling: true, the operator will insert an autoscaler sidecar container into the Ray head pod.
enableInTreeAutoscaling: true
######################headGroupSpecs#################################
# head group template and specs, (perhaps 'group' is not needed in the name)
headGroupSpec:
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
serviceType: ClusterIP
# the pod replicas in this group typed head (assuming there could be more than 1 in the future)
replicas: 1
# logical group name, for this called head-group, also can be functional
# pod type head or worker
# rayNodeType: head # Not needed since it is under the headgroup
# the following params are used to complete the ray start: ray start --head --block --port=6379 ...
rayStartParams:
# Flag "no-monitor" must be set when running the autoscaler in
# a sidecar container.
no-monitor: "true"
port: "6379"
object-manager-port: "9999"
node-manager-port: "9998"
object-store-memory: "100000000"
dashboard-host: "0.0.0.0"
node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
block: "true"
# Use `resources` to optionally specify custom resource annotations for the Ray node.
# The value of `resources` is a string-integer mapping.
# Currently, `resources` must be provided in the unfortunate format demonstrated below.
# Moreover, "CPU" and "GPU" should NOT be included in the `resources` arg.
# (Use `num-cpus` and `num-gpus` rayStartParams instead.)
# resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
#pod template
template:
metadata:
labels:
# custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
# Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
rayCluster: raycluster # will be injected if missing
rayNodeType: head # will be injected if missing, must be head or worker
groupName: headgroup # will be injected if missing
spec:
containers:
# The Ray head pod
- name: ray-head
# securityContext:
# runAsUser: 0
# All Ray pods in the RayCluster should use the same version of Ray.
image: ray:2.0.0-py39
imagePullPolicy: Always
# The KubeRay operator uses the ports specified on the ray-head container
# to configure a service targeting the ports.
# The name of the service is <ray cluster name>-head-svc.
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
env:
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: requests.memory
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
resources:
limits:
cpu: "20"
memory: "20Gi"
requests:
cpu: "10"
memory: "20Gi"
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 40
groupName: cpu-group
rayStartParams:
node-ip-address: $MY_POD_IP
block: "true"
#pod template
template:
spec:
initContainers:
# the env var $RAY_IP is set by the operator if missing, with the value of the head service name
- name: init-myservice
image: busybox:1.28
command:
[
"sh",
"-c",
"until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done",
]
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
# All Ray pods in the RayCluster should use the same version of Ray.
image: ray:2.0.0-py39
env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: TYPE
value: "worker"
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: requests.memory
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
ports:
- containerPort: 80
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
volumeMounts:
- mountPath: /var/log
name: log-volume
resources:
limits:
cpu: "16"
memory: "28Gi"
requests:
cpu: "1"
memory: "28Gi"
volumes:
- name: log-volume
emptyDir: {}
And a simple script would be:
analysis = tune.run(
"PPO",
stop={"episode_reward_mean": 100},
config={
"env": "CartPole-v1",
"num_gpus": 0,
"num_workers": 4,
"lr": tune.grid_search([0.01, 0.001]),
},
local_dir="/storage/"
)
Just re-confirming -- Is the following correct:
Got it, so the sequence of events is
I will try this out to see what's going on.
Exactly. FYI, the workers become scheduled if I manually scale them up.
Here are the additional logs:
Demands:
{'CPU': 1.0} * 17 (PACK): 6+ pending placement groups
2022-08-29 14:19:36,207 WARNING resource_demand_scheduler.py:839 -- The autoscaler could not find a node type to satisfy the request: [{'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU':
@DmitriGekhtman Looks like the issue is when "workers" exist in the Cluster it gets confused. I triggered three jobs at the same time. The first one scheduled workers fine, but then I received the error I showed above after submitting the second and third job. I set max workers to 0 (to force delete all workers), then increased it back up to 40. After increasing this number, it spawned the required resources for the first job but not the others.
For example:
Demands:
{'CPU': 1.0}: 2+ pending tasks/actors (2+ using placement groups)
{'CPU': 1.0} * 17 (PACK): 25+ pending placement groups
2022-08-29 14:49:02,751 WARNING resource_demand_scheduler.py:839 -- The autoscaler could not find a node type to satisfy the request: [{'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CP
Huh, well the workers have 16CPU each so I guess it might make sense {'CPU': 1.0} * 17 (PACK)
would have some trouble scheduling.
Pack would prioritize scheduling all 17 1-CPU requests on the same node.
@DmitriGekhtman That makes sense; however, I am receiving the exact behavior when scaling each worker up.
I increased the CPU for each to 30 and the memory to 40; but if PACK is scheduling on the same machine then definitely not enough resources on that specific node.
However, the docs mention it will schedule elsewhere if unable to schedule onto a single node? Additionally to tune, here are some docs that are informative. . (Also posting for other who read this thread).
All provided bundles are packed onto a single node on a best-effort basis. If strict packing is not feasible (i.e., some bundles do not fit on the node), bundles can be placed onto other nodes nodes.
Also, below is when I scale up CPUs.
Demands:
{'CPU': 1.0} * 9 (PACK): 10+ pending placement groups
2022-08-29 15:14:55,032 WARNING resource_demand_scheduler.py:839 -- The autoscaler could not find a node type to satisfy the request: [{'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0},
Yeah, it is true that PACK is a "soft" constraint... starting to poke around now.
Just for purposes of documenting my progress here:
In the config above, I substituted the image rayproject/ray-ml:2.0.0-py38
and in the script, I removed the argument local_dir="/storage/"
.
Modified config:
# This is adapted from https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.complete.yaml
# It is a general RayCluster that has most fields in it for maximum flexibility in the Ray/Kuberay integration MVP.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
labels:
controller-tools.k8s.io: "1.0"
# An unique identifier for the head node and workers of this cluster.
name: raycluster-complete
spec:
rayVersion: "2.0"
# With enableInTreeAutoscaling: true, the operator will insert an autoscaler sidecar container into the Ray head pod.
enableInTreeAutoscaling: true
######################headGroupSpecs#################################
# head group template and specs, (perhaps 'group' is not needed in the name)
headGroupSpec:
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
serviceType: ClusterIP
# the pod replicas in this group typed head (assuming there could be more than 1 in the future)
replicas: 1
# logical group name, for this called head-group, also can be functional
# pod type head or worker
# rayNodeType: head # Not needed since it is under the headgroup
# the following params are used to complete the ray start: ray start --head --block --port=6379 ...
rayStartParams:
# Flag "no-monitor" must be set when running the autoscaler in
# a sidecar container.
no-monitor: "true"
port: "6379"
object-manager-port: "9999"
node-manager-port: "9998"
object-store-memory: "100000000"
dashboard-host: "0.0.0.0"
node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
block: "true"
# Use `resources` to optionally specify custom resource annotations for the Ray node.
# The value of `resources` is a string-integer mapping.
# Currently, `resources` must be provided in the unfortunate format demonstrated below.
# Moreover, "CPU" and "GPU" should NOT be included in the `resources` arg.
# (Use `num-cpus` and `num-gpus` rayStartParams instead.)
# resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
#pod template
template:
metadata:
labels:
# custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
# Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
rayCluster: raycluster # will be injected if missing
rayNodeType: head # will be injected if missing, must be head or worker
groupName: headgroup # will be injected if missing
spec:
containers:
# The Ray head pod
- name: ray-head
# securityContext:
# runAsUser: 0
# All Ray pods in the RayCluster should use the same version of Ray.
image: rayproject/ray-ml:2.0.0-py38
imagePullPolicy: Always
# The KubeRay operator uses the ports specified on the ray-head container
# to configure a service targeting the ports.
# The name of the service is <ray cluster name>-head-svc.
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
env:
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: requests.memory
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
resources:
limits:
cpu: "20"
memory: "20Gi"
requests:
cpu: "10"
memory: "20Gi"
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 40
groupName: cpu-group
rayStartParams:
node-ip-address: $MY_POD_IP
block: "true"
#pod template
template:
spec:
initContainers:
# the env var $RAY_IP is set by the operator if missing, with the value of the head service name
- name: init-myservice
image: busybox:1.28
command:
[
"sh",
"-c",
"until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done",
]
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
# All Ray pods in the RayCluster should use the same version of Ray.
image: rayproject/ray-ml:2.0.0-py38
env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: TYPE
value: "worker"
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: requests.memory
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
ports:
- containerPort: 80
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
volumeMounts:
- mountPath: /var/log
name: log-volume
resources:
limits:
cpu: "16"
memory: "28Gi"
requests:
cpu: "1"
memory: "28Gi"
volumes:
- name: log-volume
emptyDir: {}
Modified script:
from ray import tune
analysis = tune.run(
"PPO",
stop={"episode_reward_mean": 100},
config={
"env": "CartPole-v1",
"num_gpus": 0,
"num_workers": 4,
"lr": tune.grid_search([0.01, 0.001]),
},
)
Now, my issue is that I there weren't enough tasks triggered to cause upscaling, and the job ran to submission successfully.
@peterghaddad could you suggest some parameters for the tune job that more closely resemble the ones for your workload?
Maybe more workers or more CPUs per worker? And/or more computationally intensive trials?
Hi @DmitriGekhtman, sure thing!
import numpy as np
from ray import tune
analysis = tune.run(
"PPO",
stop={"episode_reward_mean": 200000},
config={
"env": "CartPole-v1",
"num_gpus": 0,
"num_workers": 8,
"lr": tune.grid_search(list(np.arange(0.0, 0.999, 0.2)))
},
)
I have submitted jobs back-to-back (2-3). Is there any other information which may be helpful?
Hey @DmitriGekhtman I triggered some additional experiments with minimum resources after submitting one that required larger resources (what I posted above)
Same error, but it's only trying to place 4 CPUs for the smaller experiment.
{'CPU': 1.0} * 4 (PACK): 1+ pending placement groups
Below is the code:
from ray.air.config import RunConfig, ScalingConfig
from ray.train.rl import RLTrainer
trainer = RLTrainer(
run_config=RunConfig(stop={"training_iteration": 1}),
scaling_config=ScalingConfig(num_workers=2, use_gpu=False),
algorithm="PPO",
config={
"env": "CartPole-v1",
"framework": "tf",
"evaluation_num_workers": 1,
"evaluation_interval": 1,
"evaluation_config": {"input": "sampler"},
},
)
t = trainer.fit()
I also tried specifying placement_strategy="SPREAD"
for the scaling_config, but it is registering as PACK so haven't been able to test if SPREAD works.
back-to-back
The submissions were within seconds of each other? Or back-to-back meaning you waited for one two complete before submitting another?
Also, could you share what your underlying K8s node infrastructure looks like?
I've just submitted 3 instances of the tuning job to a Ray cluster running on GKE autopilot. Will see what happens!
I'm seeing upscaling with {'CPU': 1.0} * 9 (PACK): 10+ pending placement groups
-- I imagine those should be scheduled successfully, but we'll see.
Submitting three of these jobs in quick succession resulted in upscaling of 8 Ray worker pods. No errors yet -- will just let those jobs keep running for a while and then will try submitting again after they finish running and I see downscaling.
By the way, running ray status
in the head pod could reveal useful autoscaling information.
Autoscaler logs could also be useful to see -- these logs stream to the autoscaler container's stdout and to /tmp/ray/session_latest/logs/monitor.log
in the container's file system.
I'm unfortunately not able to reproduce the issue yet -- if you could share the autoscaler's logs, that would be helpful.
Hi @DmitriGekhtman, thank you for troubleshooting!
back-to-back
meaning I submit 2-3 jobs at the same time about a minute apart. The first one scales, but then is unable to scale the second and third.
As an FYI, I made my workers contain more CPU and memory hence less nodes. Let me know if you would like me to revert and try again. It is now:
resources:
limits:
cpu: '30'
memory: 40Gi
requests:
cpu: '30'
memory: 40Gi
I ran ray status
on the head-node and this is what I am receiving:
======== Autoscaler status: 2022-08-30 20:38:48.731382 ========
Node status
---------------------------------------------------------------
Healthy:
1 head-group
3 cpu-group
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
91.0/110.0 CPU (91.0 used of 108.0 reserved in placement groups)
0.0/32.0 GPU
0.00/140.000 GiB memory
0.00/36.004 GiB object_store_memory
Demands:
{'CPU': 1.0} * 9 (PACK): 10+ pending placement groups. 2022-08-30 20:44:25,562 WARNING resource_demand_scheduler.py:839 -- The autoscaler could not find a node type to satisfy the request: [{'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}]. Please specify a node type with the necessary resources.
I'm kicking off another job which requires less CPUs now.
Below are the autoscaling logs for this job.
I wonder if it has to do with the presence of GPUs -- there is some logic in the autoscaler which tries to avoid scaling up GPU nodes for CPU tasks.
What if you set
rayStartParams:
num-gpus: "0"
for the head and workers?
I have a very strong hunch that this is what happened:
After upscaling a worker the first time, the autoscaler detected from the running worker's load metrics that the worker has access to GPUs. After realizing that the worker node type has GPUs, the autoscaler tried to avoid scaling up the GPU nodes to schedule CPU tasks, which prevented all further upscaling.
In other words, it's likely an instance of this issue: https://github.com/ray-project/ray/issues/20476 whose priority I will now bump up.
I do expect that setting the override
rayStartParams:
num-gpus: "0"
will resolve the issue in this case.
Would you look at that!
Solved it! I appreciate all of the help.
Glad we got to the bottom of it!
Closing this as a duplicate of the underlying issue.
What happened + What you expected to happen
I am running the PPO Trainer with num_workers set to 8. It seems when I first launch an experiment after creating a new Ray Cluster, then everything gets scheduled and there are no errors with the autoscaler. After the experiment completes and I re-submit the Job, then I get the following error.
My CPU Worker has the following resources: Request 8 CPUs and Limit 16 CPUs with sufficient memory.
I was able to bypass this error by creating a new worker node called cpu-worker-small that has 1 CPU and 10GI of memory; however, then the actors fail unexpectedly (probably to resource constraints).
I saw some similar issues https://github.com/ray-project/ray/issues/12441, but it seems after the first experiment runs there is no resource demands in the autoscaler. I am using default settings from Kuberay.
After running the first experiment I check the following autoscaler state: This means that this issue https://github.com/ray-project/ray/issues/24259 made it into Ray 1.12.1
Versions / Dependencies
Ray 2.0 Kuberay on a Kubernetes Cluster (latest version).
Reproduction script
PPO with the following Configuration: { "num_gpus" : 1, "num_workers" : 2, "num_sgd_iter" : 60, "train_batch_size" : 12000, }
Issue Severity
High: It blocks me from completing my task.