Integrate with autoscaler to improve error message: The actor or task with ID [] is pending and cannot currently be scheduled

J1810Z commented 4 years ago

After setting up a new conda environment with ray, I am running into the issue that ray complains about insufficient resources. While sometimes the actors are still starting after 30s, most of the time my python program gets stuck at this point.

I am initializing ray from within my python script, which runs on a node scheduled by slurm. Access to CPUs and GPUs is limited via cgroups. psutil.Process().cpu_affinity() provides me with the correct number of available cores, which is higher than the necessary resources for ray. Interestingly, I didn't run into this issue in my previous conda environment.

The error message does not help much: 2020-05-05 17:17:32,657 WARNING worker.py:1072 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is pending and cannot currently be scheduled. It requires {object_store_memory: 0.048828 GiB} for execution and {CPU: 1.000000}, {object_store_memory: 0.048828 GiB} for placement, but this node only has remaining {node:192.168.7.50: 1.000000}, {CPU: 28.000000}, {memory: 30.029297 GiB}, {GPU: 1.000000}, {object_store_memory: 10.351562 GiB}. In total there are 0 pending tasks and 3 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

rkooo567 commented 4 years ago

@edoakes @wuisawesome I also see this warning when there are plenty of resources. This can be scheduler issues?

rkooo567 commented 4 years ago

Turns out this error occurs as a warning of resources deadlock. Here is a related issue. https://github.com/ray-project/ray/issues/5790.

GoingMyWay commented 4 years ago

@J1810Z @rkooo567 Same issue in Ray 0.8.4, and the task got stuck and was pending. Although the CPUs and memories are enough. There are 2 rollout workers with each sampling 512 sample_batch.

Ray Version: 0.8.4 OS: Linux num_workers: 2, batch_sample_size/rollout_fragment_length : 512 CPUs in machine: 41 CPUs.

WARNING worker.py:1072 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is pending and cannot currently be scheduled. It requires {CPU: 20.000000} for execution and {CPU: 20.000000} for placement, but this node only has remaining {node:100.102.34.3: 1.000000}, {CPU: 56.000000}, {memory: 148.095703 GiB}, {GPU: 8.000000}, {object_store_memory: 46.533203 GiB}. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

I do not why It requires {CPU: 20.000000} for execution and {CPU: 20.000000} for placement, it looks confusing.

ericl commented 4 years ago

Is it possible to reproduce this with a dummy program that can be shared?

GoingMyWay commented 4 years ago

Is it possible to reproduce this with a dummy program that can be shared?

Hi, I will send you the code later. I run my code with docker by specializing CPU=41 and Memory=120GB on a machine (CPU=128 or more, and Memory=256GB or more). But it returned

The actor or task with ID ffffffffffffffff45b95b1c0100 is pending and cannot currently be scheduled. It requires {CPU: 20.000000} for execution and {CPU: 20.000000} for placement, but this node only has remaining {node:9.146.140.151: 1.000000}, {CPU: 180.000000}, {memory: 271.191406 GiB}, {object_store_memory: 82.958984 GiB}. In total there are 0 pending tasks and 4 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

As you can see, the docker detected 180 CPUs and 271 GB memory. But I only assigned 41 CPUs and 120 GB memory to my task. So, do you think Ray can detect the whole resources of one machine even running in the docker with specified resources?

pitoupitou commented 4 years ago

Hi, I have got the same error and it's difficult to understand since I have lots of resources.

Running my code with Docker as well on a machine that should have enough capacity (CPU=16 & 4 GPUs - Tesla V100 with Memory=244GB). I tried different versions of Ray but it triggers the same error.

Interestingly, there is no error when I define the number of Ray workers as being 1. But when it is any number >1, the error keeps appearing.

Notes :

Every worker is defined with: @ray.remote(num_cpus=1, num_gpus=1 if torch.cuda.is_available() else 0)
Error shows: WARNING worker.py:1090 -- The actor or task with ID ffffffffffffffffcd8f56890100 is pending and cannot currently be scheduled. It requires {CPU: 1.000000}, {GPU: 1.000000} for execution and {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {node:172.31.69.55: 1.000000}, {CPU: 10.000000}, {memory: 210.498047 GiB}, {object_store_memory: 12.841797 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
"debug_state.txt" shows : NodeManager: InitialConfigResources: {GPU,4.000000},{CPU,16.000000} ClusterResources: de21c14dccdf508d03d9119b8b7acb99afa89a00: \n- total: {GPU,4.000000},{CPU,16.000000} \n- avail: {GPU,0.000000},{CPU,10.000000}

How can I send you my code?

ericl commented 4 years ago

Can you trim down your code to a minimal script that can be pasted here?

GoingMyWay commented 4 years ago

@pitoupitou Hi, what task are you working on?

ericl commented 4 years ago

Moving to p0 since this has come up anecdotally several times now. Unfortunately it seems difficult to reproduce :/

GoingMyWay commented 4 years ago

I think the main reason of this issue is when running task with docker in a machine where there are many CPUs and memory but few left for the current task, Ray can detect the whole resource setup of the task but many are not accessible, as a consequence, the task was pending (in offline training, Ray can still read the data until OOM). Another thing is, for my task, the task only took 40GB memory, but if you assign a memory limit to the docker (use default memory-related settings in Ray), say 45 GB, OOM occurred. So, I assigned a memory limit with 80GB to the docker, and 45% used, OOM did not occur, task finished successfully. My concern is the memory management is not very efficient in the current version of Ray (0.8.4).

rkooo567 commented 4 years ago

@ericl What's our plan to resolve this issue? We will have a new release in about 3 weeks. Will anyone handle this issue?

ericl commented 4 years ago

I'm not sure we can do a lot without a reproduction case that can be run locally or in the public cloud. The reports above are not very clear.

GoingMyWay commented 4 years ago

I'm not sure we can do a lot without a reproduction case that can be run locally or in the public cloud. The reports above are not very clear.

Hi, I sent you the code before, I think you can use my code to reproduce this issue, this issue occurs very often in the cloud.

ericl commented 4 years ago

Can you reproduce it on a single node (laptop) without docker? The hardware configuration is not clear to reproduce.

On Mon, May 18, 2020, 8:32 PM Alexander notifications@github.com wrote:

I'm not sure we can do a lot without a reproduction case that can be run locally or in the public cloud. The reports above are not very clear.

Hi, I sent you the code before, I think you can use my code to reproduce this issue, this issue occurs very often in the cloud.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8326#issuecomment-630558007, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSRW4OFNDARNMPTO74LRSH4V3ANCNFSM4MZVSA4A .

rkooo567 commented 4 years ago

The problem seems like Ray doesn't detect cgroup resources. I indeed found a related issue https://github.com/benoitc/gunicorn/issues/2028#

(We detects cpu count from multiprocessing.cpu_count() and memory by reading cgroup memory info). https://github.com/ray-project/ray/blob/a73c488c74b1e01da3961db2eb538c43c29753f5/python/ray/resource_spec.py#L138

To verify it, we need to be able to reproduce the issue. As Eric said, it'll be very helpful if you can provide a small reproducible script + env.

ericl commented 4 years ago

I don't that's related, the problem is the scheduler is reporting resources available and also saying it can't schedule any more tasks at the same time. Detecting the wrong resources might cause unexpected crashes, but not scheduler hangs.

On Mon, May 18, 2020, 11:36 PM SangBin Cho notifications@github.com wrote:

The problem seems like Ray detects machine resources instead of cgroup resources. I indeed found a related issue benoitc/gunicorn#2028 https://github.com/benoitc/gunicorn/issues/2028

(We detects cpu count from multiprocessing.cpu_count() and memory by reading cgroup memory info).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8326#issuecomment-630614991, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSSTVWM22SGWML76DL3RSISGXANCNFSM4MZVSA4A .

J1810Z commented 4 years ago

Since the problem occurs in my case while running ray on a slurm cluster, I am not able to reproduce it on a single machine. My minimal setup would still require at least two VMs (one control and one compute node).

Interestingly, the issue only occurs if ray is installed in a user-specific conda environment. If I install ray in the conda environment of the root user, I don't experience any issues.

J1810Z commented 4 years ago

After further experimentation, I think that I figured out what is happening.

On initialization, ray starts a worker process for each identified core. If an actor is instantiated immediately after initialization, these worker processes are not yet ready and ray attempts to create a new worker process. As a results, the number of worker processes exceeds the number of cores. If the worker to be started has specific CPU requirements (num_cpus=1), this results in the ressource error.

Adding a time delay between the initialization of ray and the actor instantiation resolves the issue. In that case, existing idle worker processes are used for the new actor.

Since this problem seems to be unrelated to slurm or docker, I am trying to build a minimal example that can be shared.

ericl commented 4 years ago

@J1810Z any hints at how to create a reproduction? I've tried a couple simple things with starting actors quickly while workers are tied up executing tasks, but no luck so far.

J1810Z commented 4 years ago

Sorry, it took me a little bit longer to reproduce this issue. It seems to be more specific to my setup than I thought: I am running ray within a conda environment on an NFS share.

If I set lookupcache to none, the described issue occurs where ray creates additional worker processes instead of using the ones set upon initialization. The issue does not occur if lookupcache is set to the default value.

Interestingly, I have not yet been able to reliably reproduce the error message from above. On some systems, ray just creates the additional processes and starts without any issues. On other systems, the warning message occurs.

It would be interesting to know whether these additional processes are also created in the docker case or if both issues are completely unrelated.

With my NFS setup, the problem can be reproduced with the following minimal example:

import ray import time ray.init()

@ray.remote(num_cpus=1) class TestActor(): def init(self): time.sleep(60)

actor_list = [] for i in range(11): # I am running this example on a 12 core system actor_list.append(TestActor.remote())

time.sleep(180)

ericl commented 4 years ago

Interesting, that suggests it's a race condition triggered by the speed of worker startup, though I don't know how NFS comes into the picture exactly.

Thanks! this is probably enough of a hint to reproduce by injecting artificial delays for testing into the raylet worker pool.

On Sat, May 23, 2020, 3:01 PM J1810Z notifications@github.com wrote:

Sorry, it took me a little bit longer to reproduce this issue. It seems to be more specific to my setup than I thought: I am running ray within a conda environment on an NFS share.

If I set lookupcache to none, the described issue occurs where ray creates additional worker processes instead of using the ones set upon initialization. The issue does not occur if lookupcache is set to the default value.

Interestingly, I have not yet been able to reliably reproduce the error message from above. On some systems, ray just creates the additional processes and starts without any issues. On other systems, the warning message occurs.

It would be interesting to know whether these additional processes are also created in the docker case or if both issues are completely unrelated.

With my NFS setup, the problem can be reproduced with the following minimal example:

import ray import time ray.init()

@ray.remote(num_cpus=1) class TestActor(): def init(self): time.sleep(60)

actor_list = [] for i in range(11): # I am running this example on a 12 core system actor_list.append(TestActor.remote())

time.sleep(180)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8326#issuecomment-633146148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSWZPEQGLVAZU4GQANTRTBBTPANCNFSM4MZVSA4A .

J1810Z commented 4 years ago

This makes sense. By deactivating the lookupcache, I am increasing the latency for accessing ray files and probably delaying the worker startup.

I hope it can be reproduced with artificial delays.

GoingMyWay commented 4 years ago

@ericl

I think I have fixed the memory leak issue when using TF 2.x with eager trace on offline data. The problem is in tf.function, since in SyncSamplesOptimizer, the length of the sampled data can exceed train_batch_size, making the length of the input data varies every time, which can make OOM occur in TF 2.x.

A sample workaround is to call .slice by making the length of sampled data equals self.train_batch_size as shown below. The cost is the last trajectory in the sampled data may be truncated by slicing. This is the cost, but I think it will not harm the performance of RL algos since this data is small, only one or zero in each batch.

    def step(self):
        with self.update_weights_timer:
            if self.workers.remote_workers():
                weights = ray.put(self.workers.local_worker().get_weights())
                for e in self.workers.remote_workers():
                    e.set_weights.remote(weights)

        with self.sample_timer:
            samples = []
            while sum(s.count for s in samples) < self.train_batch_size:
                if self.workers.remote_workers():
                    samples.extend(
                        ray_get_and_free([
                            e.sample.remote()
                            for e in self.workers.remote_workers()
                        ]))
                else:
                    samples.append(self.workers.local_worker().sample())
            samples = SampleBatch.concat_samples(samples).slice(start=0, end=self.train_batch_size)

            self.sample_timer.push_units_processed(samples.count)

        with self.grad_timer:
            fetches = do_minibatch_sgd(samples, self.policies,
                                       self.workers.local_worker(),
                                       self.num_sgd_iter,
                                       self.sgd_minibatch_size,
                                       self.standardize_fields)
        self.grad_timer.push_units_processed(samples.count)

        if len(fetches) == 1 and DEFAULT_POLICY_ID in fetches:
            self.learner_stats = fetches[DEFAULT_POLICY_ID]
        else:
            self.learner_stats = fetches
        self.num_steps_sampled += samples.count
        self.num_steps_trained += samples.count
        return self.learner_stats

GoingMyWay commented 4 years ago

This makes sense. By deactivating the lookupcache, I am increasing the latency for accessing ray files and probably delaying the worker startup.

I hope it can be reproduced with artificial delays.

Hi, do you have a working solution? Say, add time.sleep(60) after ray.init. Does it work all the time? @ericl Hi, is it possible to avoid this issue by add time.sleep(60) after ray.init?

J1810Z commented 4 years ago

Yes, adding time.sleep() works for me (as a workaround) if the sleep time is long enough. In my setup 40s works pretty well.

GoingMyWay commented 4 years ago

Yes, adding time.sleep() works for me (as a workaround) if the sleep time is long enough. In my setup 40s works pretty well.

Looks a simple and practical workaround.

PidgeyBE commented 4 years ago

FWIW: Some of the people in my company are also facing this. We have CI/CD tests that check if ray resources are released after cleaning up actors and on 0.8.3 these were running fine. However when upgrading to 0.8.5 we had to add a back off mechanism (with time.sleep()'s) to make the tests succeed. It seems for some reason the resource bookkeeping got slower, leaving more room for race conditions...

GoingMyWay commented 4 years ago

@PidgeyBE Hi, have you managed to solve it?

PidgeyBE commented 4 years ago

@GoingMyWay Hi! Turned out the issue with my colleague was elsewhere, but still, when upgrading to 0.8.5 I had to add some time.sleep()'s to wait for ray.available_resources() to be updated, while this was not the case before...

ivallesp commented 4 years ago

Coming from: https://github.com/ray-project/ray/issues/4498#issuecomment-638725420

As @GoingMyWay was suggesting, there is some problem related with Tensorflow. For those who are reproducing this problem in Ray 0.8.5, you should try switching to Pytorch to see if the issue remains. In my case, it totally disappeared.

J1810Z commented 4 years ago

I do not think that the problem is related to tensorflow. I imagine that the tensorflow issue might affect timing somehow and thereby affect this issue. In my minimal example from above, I was able to reproduce the issue, without using tensorflow or pytorch at all.

GoingMyWay commented 4 years ago

I do not think that the problem is related to tensorflow. I imagine that the tensorflow issue might affect timing somehow and thereby affect this issue. In my minimal example from above, I was able to reproduce the issue, without using tensorflow or pytorch at all.

Hi, in my case, the memory leak issue is related to tf.funciton when inputting varying bath size data.

ivallesp commented 4 years ago

On my side, Tensorflow (2.2.0) is not working well at all. Please find attached a snapshot of my tensorboard showing the behaviour I found. The grey line is the agent being trained with Pytorch, the other lines, with Tensorflow. Notice that the X axis is relative time. As you can see, with tensorflow as the time passes the training slows down a lot. Also the memory is going up and I think this is what affects the computing time. With Pytorch it is totally linear.

Notice that I am training a PPO model with the default settings (also default architecture); so there is no experience replay and hence no reason to have increasing memory.

ericl commented 4 years ago

Can you file the tensorflow issue as a separate bug with a reproduction script? This issue is about the scheduler hang. Thanks!

On Thu, Jun 4, 2020, 7:09 AM Iván Vallés Pérez notifications@github.com wrote:

On my side, Tensorflow (2.2.0) is not working well at all. Please find attached a snapshot of my tensorboard showing the behaviour I found. The grey line is the agent being trained with Pytorch, the other lines, with Tensorflow. Notice that the X axis is relative time. As you can see, with tensorflow as the time passes the training slows down a lot. Also the memory is going up and I think this is what affects the computing time. With Pytorch it is totally linear.

Notice that I am training a PPO model with the default settings (also default architecture); so there is no experience replay and hence no reason to have increasing memory.

[image: Screenshot 2020-06-04 at 16 06 03] https://user-images.githubusercontent.com/7207415/83766970-5414e600-a67d-11ea-8e50-42e47df98499.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8326#issuecomment-638873641, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSWIGZKURR3XEGQYT5DRU6TLBANCNFSM4MZVSA4A .

ericl commented 4 years ago

@J1810Z in the example you had, does the warning have any adverse effect? I'm able to reproduce the warning by injecting a sleep(10) into the init method of worker.py::Worker, but the following script still eventually runs and finishes.

import ray
import time
ray.init(num_cpus=12)

@ray.remote(num_cpus=1)
class TestActor():
  def init(self):
    time.sleep(60)
  def f(self):
    pass

actor_list = []
for i in range(11): # I am running this example on a 12 core system
   actor_list.append(TestActor.remote())

ray.get([a.f.remote() for a in actor_list])
print("OK")

This PR should make it so we warn less frequently in case of slow worker start: https://github.com/ray-project/ray/pull/8810/files

J1810Z commented 4 years ago

With the warning, startup is significantly slowed down. However, the behavior is quite inconsistent. Sometimes it takes around 30s until the script starts, sometimes I just canceled the run since nothing happened for several minutes. WIth the sleep time workaround, overall startup is much faster.

ericl commented 4 years ago

Hmm, I wonder if there is some other underlying issue with starting workers that is just manifesting this warning as a symptom here. @J1810Z can you reproduce, and when you encounter a "real hang", grab the output of /tmp/ray/session_latest/logs/debug_state.txt? It should include a bunch of stats including the worker pool size:

WorkerPool:
- num PYTHON workers: 8
- num PYTHON drivers: 1

This will help narrow down whether it's an issue starting workers or in the scheduler. It would be also great to have an entire zip of the session logs directory if possible.

J1810Z commented 4 years ago

Sorry for the late reply! Yes, I am trying to do that over this weekend.

Schweini-PEK commented 4 years ago

Reporting the same issue from a Berkeley researcher. The code used to run well on Savio till June, without updating any packages. The Ray version was 0.8.2 and has been tried with 0.8.5. But it's still working on Macbook. Adding time.sleep() works for me so far.

ericl commented 4 years ago

@Schweini-PEK could you try grabbing the log data mentioned two comments up too?

annaluo676 commented 4 years ago

Got the same error with Amazon SageMaker. I was able to reproduce the issue consistently with the homogeneous scaling part in this notebook example.

Docker image: custom docker built on top of ray 0.8.2 (Dockerfile) instance type: ml.p2.8xlarge instance count: 2

With num_gpus=15, this is what I got:

Resources requested: 61/64 CPUs, 15/16 GPUs, 0.0/899.46 GiB heap, 0.0/25.68 GiB objects
...
2020-06-25 03:19:40,072#011WARNING worker.py:1058 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is infeasible and cannot currently be scheduled. 
It requires {GPU: 15.000000}, {CPU: 1.000000} for execution and {GPU: 15.000000}, {CPU: 1.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. 
To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.

The experiments worked fine with num_gpus <= 8.

Adding a time.sleep() doesn't solve the issue.

GoingMyWay commented 4 years ago

Got the same error with Amazon SageMaker. I was able to reproduce the issue consistently with the homogeneous scaling part in this notebook example.

Docker image: custom docker built on top of ray 0.8.2 (Dockerfile) instance type: ml.p2.8xlarge instance count: 2

With num_gpus=15, this is what I got:
Resources requested: 61/64 CPUs, 15/16 GPUs, 0.0/899.46 GiB heap, 0.0/25.68 GiB objects
...
2020-06-25 03:19:40,072#011WARNING worker.py:1058 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is infeasible and cannot currently be scheduled. 
It requires {GPU: 15.000000}, {CPU: 1.000000} for execution and {GPU: 15.000000}, {CPU: 1.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. 
To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.
The experiments worked fine with num_gpus <= 8.

Adding a time.sleep() doesn't solve the issue.

Hi, Ms Luo, have you tried ray 0.9 dev?

annaluo676 commented 4 years ago

Got the same error with Amazon SageMaker. I was able to reproduce the issue consistently with the homogeneous scaling part in this notebook example. Docker image: custom docker built on top of ray 0.8.2 (Dockerfile) instance type: ml.p2.8xlarge instance count: 2 With num_gpus=15, this is what I got:
Resources requested: 61/64 CPUs, 15/16 GPUs, 0.0/899.46 GiB heap, 0.0/25.68 GiB objects
...
2020-06-25 03:19:40,072#011WARNING worker.py:1058 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is infeasible and cannot currently be scheduled. 
It requires {GPU: 15.000000}, {CPU: 1.000000} for execution and {GPU: 15.000000}, {CPU: 1.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. 
To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.
The experiments worked fine with num_gpus <= 8. Adding a time.sleep() doesn't solve the issue.
Hi, Ms Luo, have you tried ray 0.9 dev?

Unfortunately I was not able to. I have to reproduce with SageMaker with public sagemaker-ray image.

p.s. I cannot reproduce with CPU resources. For example, with multiple cpu instances, setting num_workers = total_cpus - 1 works fine, with "1" left for the schedular. In all experiments there exists only 1 trial.

ericl commented 4 years ago

That seems expected since the node only has 8 GPUs right? So the warning seems to be correct in that case.

The real problem is when it hangs even though the resources are available.

On Thu, Jun 25, 2020, 1:17 AM Anna Luo notifications@github.com wrote:

Got the same error with Amazon SageMaker. I was able to reproduce the issue consistently with the homogeneous scaling part in this https://github.com/awslabs/amazon-sagemaker-examples/blob/master/reinforcement_learning/rl_roboschool_ray/rl_roboschool_ray_distributed.ipynb notebook example. Docker image: custom docker built on top of ray 0.8.2 (Dockerfile https://github.com/awslabs/amazon-sagemaker-examples/blob/master/reinforcement_learning/rl_roboschool_ray/Dockerfile ) instance type: ml.p2.8xlarge instance count: 2 With num_gpus=15, this is what I got:

Resources requested: 61/64 CPUs, 15/16 GPUs, 0.0/899.46 GiB heap, 0.0/25.68 GiB objects ... 2020-06-25 03:19:40,072#011WARNING worker.py:1058 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is infeasible and cannot currently be scheduled. It requires {GPU: 15.000000}, {CPU: 1.000000} for execution and {GPU: 15.000000}, {CPU: 1.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.

The experiments worked fine with num_gpus <= 8. Adding a time.sleep() doesn't solve the issue.

Hi, Ms Luo, have you tried ray 0.9 dev?

Unfortunately I was not able to. I have to reproduce with SageMaker with public sagemaker-ray image.

p.s. I cannot reproduce with CPU resources. For example, with multiple cpu instances, setting num_workers = total_cpus - 1 works fine, with "1" left for the schedular.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8326#issuecomment-649356225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSVKCISO6HJ3QM4ATBTRYMB2JANCNFSM4MZVSA4A .

annaluo676 commented 4 years ago

There are two p2.8xlarge instances (instance count: 2), leading to 16 GPUs in total.

ericl commented 4 years ago

If you set num_gpus: 16, that tries to create a single actor with 16 gpus, which cannot fit in either of those small nodes. You would need a p2.16xl in this case.

On Thu, Jun 25, 2020, 10:04 AM Anna Luo notifications@github.com wrote:

There are two p2.8xlarge instances (instance count: 2), leading to 16 GPUs in total.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8326#issuecomment-649705308, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSRI7BGS6GN7MRGHN4LRYN7RTANCNFSM4MZVSA4A .

ericl commented 4 years ago

Closing this issue as it has become confused with other problems. Please open a new bug if a reproduction is possible.

ericl commented 3 years ago

Duplicates https://github.com/ray-project/ray/issues/8326

wuisawesome commented 3 years ago

Wait @ericl did you intentionally close this as a duplicate of itself?

ericl commented 3 years ago

Sorry, it duplicates https://github.com/ray-project/ray/issues/15933

ray-project / ray

Integrate with autoscaler to improve error message: The actor or task with ID [] is pending and cannot currently be scheduled #8326