Ray component: Core: PoolActor processes hanging

What happened + What you expected to happen

I'm attempting to use a hybrid solution of ray.util.multiprocessing.Pool and multiprocessing.Pool. It works exactly as I'd hope for about 19 of every 20 jobs, but on the 1 job where it doesn't our Python script hangs at the Ray pool.map step. When I look at running processes I can see that, along with my hung Python script, there is what appears to be one or more hung Ray PoolActors.

It's worth mentioning that my script runs thousands of processes which each complete within less than a second. I split those processes into batches (one per node) and then use Ray to send those batches to the nodes. On the nodes I use multiprocessing.Pool to actually run the processes. I found that using ray.util.multiprocessing.Pool to manage all of the processes across the nodes was slower than just using multiprocessing.Pool on a single node.

Versions / Dependencies

Ray Version: 1.12.0 Python Version: 3.6.8 OS: Oracle Linux 7.6

Reproduction script

To run correctly, it requires two files. The logging helps me identify hung jobs (shorter log file).

File 1

import logging
import datetime
import ray
from ray.util.multiprocessing import Pool
from import_file import batch_function
ray.init(address="auto")

logging.basicConfig(level=logging.INFO)

def setup_rootlogger(dir, name):
    root_logger = logging.getLogger()
    root_logger.setLevel(logging.INFO)
    fh = logging.FileHandler('{}/{}.{}.log'.format(dir, name, datetime.datetime.now().strftime('%Y-%m-%d_%H%M')))
    fh.setLevel(logging.INFO)
    fh.setFormatter(logging.Formatter('%(asctime)s %(levelname)s %(name)s [%(filename)s:%(lineno)d] %(message)s'))
    root_logger.addHandler(fh)

def batches(l, n):
    return_l = []
    for i in range(0, n):
        return_l.append(l[i::n])
    return return_l

setup_rootlogger('/home/ray_experiment/logs', 'monitor')
processes = range(6074)   
cpus = int(ray.cluster_resources()['CPU'])  
processes = batches(processes, cpus)
logging.info('###### Multiprocessing start ######')

with Pool(cpus) as pool:
    results = pool.map(batch_function, [p for p in processes])

logging.info('###### Multiprocessing complete ######')

File 2 (Referred to as import_file in File 1)

from time import sleep
from multiprocessing import Pool

def lower_function(args):
    sleep(.04)
    sleep(.42)
    return f'Result: {args}'

def batch_function(batch):
    with Pool(40) as pool:
        results = pool.map(lower_function, [p for p in batch])
    return results

Issue Severity

High: It blocks me from completing my task.

I tried to reproduce and got the following logs,

.
.
.
(PoolActor pid=10696)     return _bootstrap._gcd_import(name[level:], package, level)
(PoolActor pid=10696)   File "C:\ProgramData\Anaconda3\envs\ray_dev\lib\site-packages\opentelemetry\trace\__init__.py", line 87, in <module>
(PoolActor pid=10696)     from opentelemetry.trace.propagation import (
(PoolActor pid=10696)   File "C:\ProgramData\Anaconda3\envs\ray_dev\lib\site-packages\opentelemetry\trace\propagation\__init__.py", line 18, in <module>
(PoolActor pid=10696)     from opentelemetry.trace.span import INVALID_SPAN, Span
(PoolActor pid=10696)   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
(PoolActor pid=10696)   File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
(PoolActor pid=10696)   File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
(PoolActor pid=10696)   File "<frozen importlib._bootstrap_external>", line 839, in exec_module
(PoolActor pid=10696)   File "<frozen importlib._bootstrap_external>", line 934, in get_code
(PoolActor pid=10696)   File "<frozen importlib._bootstrap_external>", line 1033, in get_data
(PoolActor pid=10696) MemoryError
(PoolActor pid=10696) Exception ignored in: <module 'collections.abc' from 'C:\\ProgramData\\Anaconda3\\envs\\ray_dev\\lib\\collections\\abc.py'>
(PoolActor pid=10696) Traceback (most recent call last):
(PoolActor pid=10696)   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
(PoolActor pid=10696) MemoryError:
. # hangs forever

On attempting to close the Anaconda Command Prompt I get the following as well,

2022-05-18 11:05:27,395 WARNING worker.py:1416 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa7de780e66e673df12b4ca1401000000 Worker ID: 825e3d0546161fc54d79cbb8e195624a1e91cc1841f9ead3434e2e78 Node ID: d6fa6e09a90e7376079f95fcb3aece6e4cbe2ed84148c29a80a0bfe1 Worker IP address: 127.0.0.1 Worker port: 54967 Worker PID: 8444
2022-05-18 11:05:27,426 WARNING worker.py:1416 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff7fe4019892c4d2ea88d5b68f01000000 Worker ID: 1f07346af3c3555aaca9cfe52cebd4141ca0d752f622d76dbff57fad Node ID: d6fa6e09a90e7376079f95fcb3aece6e4cbe2ed84148c29a80a0bfe1 Worker IP address: 127.0.0.1 Worker port: 54979 Worker PID: 8900
2022-05-18 11:05:27,473 WARNING worker.py:1416 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff7627135d6bf7cfb8416a2e6f01000000 Worker ID: 8a9a446d735c4b9652cc324e2c5c6dd1c5465db9d4dcc79f34b4f1da Node ID: d6fa6e09a90e7376079f95fcb3aece6e4cbe2ed84148c29a80a0bfe1 Worker IP address: 127.0.0.1 Worker port: 55024 Worker PID: 10696
2022-05-18 11:05:27,552 WARNING worker.py:1416 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff770350ea455dc7a4ccca51c501000000 Worker ID: 02e14ab00fa30381ee64d0e3c46d74b6a0b8611a144d7d44a9743f74 Node ID: d6fa6e09a90e7376079f95fcb3aece6e4cbe2ed84148c29a80a0bfe1 Worker IP address: 127.0.0.1 Worker port: 54991 Worker PID: 988
2022-05-18 11:05:27,598 WARNING worker.py:1416 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffdf6cb6786f416f953083735f01000000 Worker ID: 5347cf46d5fabb99232cd55af0129372cfe1fed2941ed4a0e22f3c00 Node ID: d6fa6e09a90e7376079f95fcb3aece6e4cbe2ed84148c29a80a0bfe1 Worker IP address: 127.0.0.1 Worker port: 55018 Worker PID: 8048
2022-05-18 11:05:27,661 WARNING worker.py:1416 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa913a054a2794b26826a5c0e01000000 Worker ID: 0ddcd06b5607e087770ecc74147dc8743023c1393f20a9a7b002c0d7 Node ID: d6fa6e09a90e7376079f95fcb3aece6e4cbe2ed84148c29a80a0bfe1 Worker IP address: 127.0.0.1 Worker port: 54995 Worker PID: 10516
2022-05-18 11:05:27,723 WARNING worker.py:1416 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff758f1524d260d86889ddcc1701000000 Worker ID: 3c13cdca59b3832c718d74770a4c982a6ebeb29d6772fe82bf3c88a3 Node ID: d6fa6e09a90e7376079f95fcb3aece6e4cbe2ed84148c29a80a0bfe1 Worker IP address: 127.0.0.1 Worker port: 54981 Worker PID: 6376
2022-05-18 11:05:27,899 WARNING worker.py:1416 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff39960da0662b140eb253556a01000000 Worker ID: baf42f6d89e7b13afe3f4cc83f250c327da637d35d003c68fb2d74b4 Node ID: d6fa6e09a90e7376079f95fcb3aece6e4cbe2ed84148c29a80a0bfe1 Worker IP address: 127.0.0.1 Worker port: 54992 Worker PID: 7400
Traceback (most recent call last):
  File "hang.py", line 32, in <module>
    results = pool.map(batch_function, [p for p in processes])
  File "c:\users\gagan\ray_project\ray\python\ray\util\multiprocessing\pool.py", line 844, in map
    return self._map_async(
  File "c:\users\gagan\ray_project\ray\python\ray\util\multiprocessing\pool.py", line 351, in get
    raise result
  File "c:\users\gagan\ray_project\ray\python\ray\util\multiprocessing\pool.py", line 258, in run
    batch = ray.get(ready_id)
  File "c:\users\gagan\ray_project\ray\python\ray\_private\client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "c:\users\gagan\ray_project\ray\python\ray\worker.py", line 1845, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: PoolActor
        actor_id: 39960da0662b140eb253556a01000000
        pid: 7400
        namespace: cc5cea44-3e21-4d21-a868-156c767063b9
        ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR_EXIT

I changed with Pool(40) as pool: to with Pool(8) as pool: and added a debug print print(processes) after processes = batches(processes, cpus) and I got the following output,

2022-05-19 07:54:57,623 INFO services.py:1478 -- View the Ray dashboard at http://127.0.0.1:8265
processes:  [range(0, 6074, 8), range(1, 6074, 8), range(2, 6074, 8), range(3, 6074, 8), range(4, 6074, 8), range(5, 6074, 8), range(6, 6074, 8), range(7, 6074, 8)] 8
INFO:root:###### Multiprocessing start ######
INFO:root:###### Multiprocessing complete ######

The code didn't hang.

I did some calculations. So, you are launching a pool first with 8 processes in the beginning. I am referring to with Pool(cpus) as pool:. Now in the import_file, you launch another pool in each of these 8 pools. Kind of nested pooling. So, each of these nested pool now start 40 processes. I am referring to with Pool(40) as pool:. So, overall total number of processes started is 40 * 8 = 320. If that's the case then I am not surprised if the script hangs, especially on Windows. In addition, when 320 processes are there then average load process is 6074/320 ~ 19. If we replace with Pool(40) as pool: with with Pool(8) as pool: then total number of processes comes down to 64 and average load per process increases to 6074/64 ~ 95. So, may be try to increase the average load per processes so that we need lesser number of processes to be launched to get the job down.

P.S. Please feel free to let me know if I made a mistake in analysing your problem.

Thanks, I appreciate your insight. I'm using three Linux nodes, not Windows, and I restrict Ray to be able to use one CPU per node.

ray start -v --head --port=6379 --num-cpus=1 --include-dashboard=false ray start -v --address=$RAY_HEAD_IP:6379 --num-cpus=1

Being so, the math for my tests don't line up with what you've shown, but I will play with those settings and see if I can eliminate the hung processes.

Something to consider. I ran my script for about five days with a cronjob kicking it off once every five minutes. I only saw two memory errors in the logs even though the job failed many times. In addition, my script runs successfully more than 90% of the time. It's just the occasional script that hangs.

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

ray-project / ray