Closed luxunxiansheng closed 1 year ago
looks most likely what happened is that the raylet is starved by the cpu intensive workloads.
Here is another thread on the same issue. may worth taking a look . https://discuss.ray.io/t/system-will-be-halted-when-tasks-number-is-large/9754
@luxunxiansheng unfortunately i can't reproduce this issue on my laptop. As we have the discussion happening in https://discuss.ray.io/t/system-will-be-halted-when-tasks-number-is-large/9754, it most likely due to some resource limitation specific to the server you are using. Let's continue our conversation in that thread.
What happened + What you expected to happen
I am learning ray by following the tutorial :Example 3: How to use Ray distributed tasks for image transformation and computation.
The given example will simulate a compute-intensive task by transforming and computing some operations on large high-resolution images. The tasks will perform the following compute-intensive transformations:
Use PIL APIs to blur the image with a filter intensity Use Torchvision random trivial wide augmentation Convert images into numpy array and tensors and do numpy and torch tensor operations such as transpose, element-wise multiplication with a random integers Do more exponential tensor power and multiplication with tensors
The problem is when I set the batch number to ,say, 100, the system will halt and only small part of the tasks have been finshied and most of them failed.
raylet.err.log raylet.out.log
Versions / Dependencies
Ray 2.3.0 Python 3.9 OS Centos 7.9 CPUs : 128 Mem: 131451244 kB
Reproduction script
https://github.com/ray-project/ray-educational-materials/blob/main/Ray_Core/Ray_Core_1_Remote_Functions.ipynb
Issue Severity
High: It blocks me from completing my task.