ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.05k stars 5.59k forks source link

[Core] System Halts when processing many high resolution images in a single node. #33244

Closed luxunxiansheng closed 1 year ago

luxunxiansheng commented 1 year ago

What happened + What you expected to happen

I am learning ray by following the tutorial :Example 3: How to use Ray distributed tasks for image transformation and computation.

The given example will simulate a compute-intensive task by transforming and computing some operations on large high-resolution images. The tasks will perform the following compute-intensive transformations:

Use PIL APIs to blur the image with a filter intensity Use Torchvision random trivial wide augmentation Convert images into numpy array and tensors and do numpy and torch tensor operations such as transpose, element-wise multiplication with a random integers Do more exponential tensor power and multiplication with tensors

The problem is when I set the batch number to ,say, 100, the system will halt and only small part of the tasks have been finshied and most of them failed.
raylet.err.log raylet.out.log

Versions / Dependencies

Ray 2.3.0 Python 3.9 OS Centos 7.9 CPUs : 128 Mem: 131451244 kB

Reproduction script

https://github.com/ray-project/ray-educational-materials/blob/main/Ray_Core/Ray_Core_1_Remote_Functions.ipynb

Issue Severity

High: It blocks me from completing my task.

scv119 commented 1 year ago

looks most likely what happened is that the raylet is starved by the cpu intensive workloads.

luxunxiansheng commented 1 year ago

Here is another thread on the same issue. may worth taking a look . https://discuss.ray.io/t/system-will-be-halted-when-tasks-number-is-large/9754

scv119 commented 1 year ago

@luxunxiansheng unfortunately i can't reproduce this issue on my laptop. As we have the discussion happening in https://discuss.ray.io/t/system-will-be-halted-when-tasks-number-is-large/9754, it most likely due to some resource limitation specific to the server you are using. Let's continue our conversation in that thread.