[Core] SpillWorker and RestoreWorker have high memory usage

leixm commented 1 year ago

What happened + What you expected to happen

I deployed a 5*4C20G ray cluster on k8s, and used the Ray Data API to do some map and groupby operations. When my job ends and the client closes the session, I find that the memory usage of SpillWorker and RestoreWorker is still high , the memory usage of a single SpillWorker process is as high as 5G+, and the SpillWorker/RestoreWorker process will not be killed. Is this a normal phenomenon? Does Ray support the kill operation for the io worker process of continuous IDLE? Memory usage

Versions / Dependencies

Ray Version: 2.2.0、2.3.0 Python: 3.7.16 OS: Ubuntu 20.04.5 LTS

Reproduction script

import pandas as pd
import ray
from ray.data.aggregate import Max

def generate_data():
    return 1000000 * "test_data"

ray.init()
ds = ray.data.range(10000).map(lambda x: {"key": x%100, "value": x, "extra_col": generate_data()})
ds.groupby("key").aggregate(Max('value')).show(1)
ray.shutdown()

Issue Severity

High: It blocks me from completing my task.

leixm commented 1 year ago

When I look at /dev/shm there is no drop in occupancy

Filesystem      Size  Used Avail Use% Mounted on
overlay         745G  568G  177G  77% /
tmpfs            64M     0   64M   0% /dev
tmpfs            63G     0   63G   0% /sys/fs/cgroup
tmpfs            63G   17G   47G  27% /dev/shm

When I execute pmap command on SpillWorker process

pmap -x 5037 | grep plasma
00007f0dfcca0000 17485184 12296128 12296128 rw-s- plasmawdebze (deleted)

leixm commented 1 year ago

cc @ericl Can you take a look? Thank you so much.

ericl commented 1 year ago

One thing to watch out for is shared vs total memory. The shared memory object store's memory is reported as memory usage for all worker processes, so you have to subtract this out (e.g., RSS - SHR) to get the actual memory overhead per worker.

If you do this subtraction, does the spill/restore worker memory still seem high?

leixm commented 1 year ago

cc @rickyyx When running large data sets, it seems very easy to OOM, do you have a better suggestion? The following is a task that takes up 2G of memory and cannot be successfully run on an 80G worker node.

OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: xx.xx.xx.xx, ID: 9a332a2af973dc171e677d3a480c9fa07bbad59a95791185de9f22d4) where the task (task ID: 56530841d51ccb005252a8242ec172e3002da16002000000, name=map, pid=30391, memory used=1.92GB) was running was 77.51GB / 83.82GB (0.924687), which exceeds the memory usage threshold of 0.9. Ray killed this worker (ID: a957fc481f2478c25078bedeebf7fc40d6a0bd5c25efc55b6f5d8598) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip xx.xx.xx.xx`. To see the logs of the worker, use `ray logs worker-a957fc481f2478c25078bedeebf7fc40d6a0bd5c25efc55b6f5d8598*out -ip xx.xx.xx.xx. Top 10 memory users:
PID MEM(GB) COMMAND
4286    9.09    ray::IDLE_SpillWorker
488 5.48    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=...
11009   4.46    ray::IDLE_SpillWorker
10771   2.92    ray::IDLE_SpillWorker
30391   1.92    ray::map
9661    1.21    ray::IDLE_RestoreWorker
9660    1.21    ray::IDLE_RestoreWorker
4172    0.98    ray::IDLE_SpillWorker
59  0.21    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...

leixm commented 1 year ago

cc @rickyyx When running large data sets, it seems very easy to OOM, do you have a better suggestion? The following is a task that takes up 2G of memory and cannot be successfully run on an 80G worker node.

OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: xx.xx.xx.xx, ID: 9a332a2af973dc171e677d3a480c9fa07bbad59a95791185de9f22d4) where the task (task ID: 56530841d51ccb005252a8242ec172e3002da16002000000, name=map, pid=30391, memory used=1.92GB) was running was 77.51GB / 83.82GB (0.924687), which exceeds the memory usage threshold of 0.9. Ray killed this worker (ID: a957fc481f2478c25078bedeebf7fc40d6a0bd5c25efc55b6f5d8598) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip xx.xx.xx.xx`. To see the logs of the worker, use `ray logs worker-a957fc481f2478c25078bedeebf7fc40d6a0bd5c25efc55b6f5d8598*out -ip xx.xx.xx.xx. Top 10 memory users:
PID   MEM(GB) COMMAND
4286  9.09    ray::IDLE_SpillWorker
488   5.48    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=...
11009 4.46    ray::IDLE_SpillWorker
10771 2.92    ray::IDLE_SpillWorker
30391 1.92    ray::map
9661  1.21    ray::IDLE_RestoreWorker
9660  1.21    ray::IDLE_RestoreWorker
4172  0.98    ray::IDLE_SpillWorker
59    0.21    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...

I found that the task of the same data set is usually successful for the first time, but it is easy to fail the second time, or there will be many task oom killed.

leixm commented 1 year ago

One thing to watch out for is shared vs total memory. The shared memory object store's memory is reported as memory usage for all worker processes, so you have to subtract this out (e.g., RSS - SHR) to get the actual memory overhead per worker.

If you do this subtraction, does the spill/restore worker memory still seem high?

Thanks for your reply, I checked the spill worker and restore worker, when the job ends, the RSS-SHR is not a big value.

rickyyx commented 1 year ago

cc @rickyyx When running large data sets, it seems very easy to OOM, do you have a better suggestion? The following is a task that takes up 2G of memory and cannot be successfully run on an 80G worker node.

OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: xx.xx.xx.xx, ID: 9a332a2af973dc171e677d3a480c9fa07bbad59a95791185de9f22d4) where the task (task ID: 56530841d51ccb005252a8242ec172e3002da16002000000, name=map, pid=30391, memory used=1.92GB) was running was 77.51GB / 83.82GB (0.924687), which exceeds the memory usage threshold of 0.9. Ray killed this worker (ID: a957fc481f2478c25078bedeebf7fc40d6a0bd5c25efc55b6f5d8598) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip xx.xx.xx.xx`. To see the logs of the worker, use `ray logs worker-a957fc481f2478c25078bedeebf7fc40d6a0bd5c25efc55b6f5d8598*out -ip xx.xx.xx.xx. Top 10 memory users:
PID   MEM(GB) COMMAND
4286  9.09    ray::IDLE_SpillWorker
488   5.48    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=...
11009 4.46    ray::IDLE_SpillWorker
10771 2.92    ray::IDLE_SpillWorker
30391 1.92    ray::map
9661  1.21    ray::IDLE_RestoreWorker
9660  1.21    ray::IDLE_RestoreWorker
4172  0.98    ray::IDLE_SpillWorker
59    0.21    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...

Is this issue reproducible from the script you provided in the PR description?
Do you also know the approx number of workers you had on that node?
Do you know the bandwidth of the disk IO? Was it splilling quick enough?

leixm commented 1 year ago

I carefully observed the monitoring and worker process status on the worker node, and found that it is mainly caused by high cache memory usage, and ray OOMKiller will count cache memory together (when cgroup is enabled). Raylet, SpillWorker, and RestoreWorker also occupy part of the memory, but the proportion is not large, within the normal range. I think this is not a BUG, This issue can be closed, Thank you for your patience and kindness @ericl @rickyyx .

rickyyx commented 1 year ago

and ray OOMKiller will count cache memory together (when cgroup is enabled).

Is this memory counting confusing for you?

When running large data sets, it seems very easy to OOM, do you have a better suggestion? The following is a task that takes up 2G of memory and cannot be successfully run on an 80G worker node.

So the original OOM problem ^ happened because of excessive memory usage is the task?

ray-project / ray