[P0][Bug] Memory leak in ray head

scv119 commented 2 years ago

Search before asking

[X] I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

https://discuss.ray.io/t/memory-leak-in-ray-head/4381

Versions / Dependencies

N/A

Reproduction script

N/A

Anything else

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

rkooo567 commented 2 years ago

This is the original issue https://discuss.ray.io/t/memory-leak-in-ray-head/4381/3

I'd like to ask you some questions regarding the issue.

(1) As @scv119 asked in the original thread, is it possible for you guys to track per process CPU/memory usage over time to see what are offenders? (e.g., this is one of possible ways to do this https://github.com/ray-project/ray/blob/f04ee71dc7ba1e4a84b8bda41d55d9bd6cebc7d7/python/ray/_private/test_utils.py#L809). Or you guys can run a job that consistently collect CPU/memory usage per process on a head node.

(2) can you tell us Ray version, Python version, machine spec (CPU & memory resources) & your operating systems? Is it happening on all nodes or head node?

(3) Lastly, the ideal way to fix the issue is trying to reproduce the issue on our end. This will help us fixing issue at the maximum speed. Can you

ideally provide the "repro script" that doesn't have external dependencies?
is it possible to precisely describe your workload? Like how many nodes you use, what library / actor & task patterns you guys have.
tell us what your cluster is doing? Is it increasing memory usage while you are running a workload? Has that been idle?
Lastly, can you tell us what you've observed (how much memory usage increase you've seen) and what's the expected behavior?

(4) Do you use the dashboard? If so, starting ray without a dashboard ray start --include-dashboard=False alleviate the issue? (I am aware of one issue from the dashboard, and I'd like to see if this is the root cause of your issue)

(if you prefer to answer in person, that also works)

sandratatarevicova commented 2 years ago

(1) As @scv119 asked in the original thread, is it possible for you guys to track per process CPU/memory usage over time to see what are offenders?

I have created a script that logs CPU and memory usage of the head node every 10 minutes. The head node was running for one day now and the log shows that Redis processes consume the most CPU and memory and that its CPU and memory usage is still increasing. I will send you the complete log to Slack.

(2) can you tell us Ray version, Python version, machine spec (CPU & memory resources) & your operating systems? Is it happening on all nodes or head node?

Ray version: We are currently using Ray 1.8.0, we have had the same problem on Ray 1.6.0 Python version: 3.9.2 Machine spec: The head pod has memory limit set to 8 GB and CPU limit set to 2 CPUs. Operating system: All Ray nodes run in Docker containers in GKE. The Docker image is based on debian:bullseye-slim.

We use autoscaling and the only persistent nodes are the head node and the operator node. The operator node does not seem to have this issue. Worker pods usually don’t run for more than 1 hour as they are removed after 5 minutes of inactivity, so we don’t know if they have the same issue, but they probably don't, because they don't have Redis.

(3) Lastly, the ideal way to fix the issue is trying to reproduce the issue on our end. This will help us fixing issue at the maximum speed. Can you

ideally provide the "repro script" that doesn't have external dependencies?

We have created a repro script: https://gist.github.com/jakub-valenta/6bc918e147c08e64f1b4fd1076f5f272

If you run this script for some time you will see a growing number of records in the Redis database. It will eventually consume all memory assigned to the head pod. From our investigation, it seems that those records are never deleted.

Running clean.py scripts releases all leaked memory. But it breaks the cluster as some of those records are still needed.

is it possible to precisely describe your workload? Like how many nodes you use, what library / actor & task patterns you guys have.

We are using it to process weather forecast data. Data are typically distributed in grib format with size up to 1 GB per step. We need to convert it to our internal format. Forecasts are computed in 1-3 hour steps up to 10 days (about 140 steps per forecast run). This forecast step is processed as one ray job and the job is just a simple function. Both source data and computation results are stored in a shared filestore or downloaded from a remote source so we are passing only a small amount of data through the RPC.

Also, we are processing data from multiple sources (about 20). Based on forecast resolution we use different sized workers. Custom resources are used for job scheduling. Processing of all data sources is done in parallel. It is the same codebase with a different configuration. There are about 20 “clients” scheduling jobs to the same cluster.

tell us what your cluster is doing? Is it increasing memory usage while you are running a workload? Has that been idle?

Weather forecasts are typically updated 4 times a day. There are 4 spikes when the cluster is under heavy load but the rest of the day is almost idle.

Lastly, can you tell us what you've observed (how much memory usage increase you've seen) and what's the expected behavior?

The memory usage goes to 8 GB (which is our memory limit on the Kubernetes pod). The main problem is not the memory usage itself - we can assign more than 8 GB to the head node. The problem is that the cluster becomes unstable (the jobs timeouts, some of the jobs never run, etc.) and we have to restart the head node every ~2 days.

(4) Do you use the dashboard? If so, starting ray without a dashboard ray start --include-dashboard=False alleviate the issue? (I am aware of one issue from the dashboard, and I'd like to see if this is the root cause of your issue)

We have initially suspected the dashboard, because its CPU consumption was also pretty high, so we have already disabled it and we are not using it now. Unfortunately, it was not the root cause as the issue still persists. Interresting thing is, that the Redis database contains some dashboard records even when the dashboard is disabled.

rkooo567 commented 2 years ago

If you run this script for some time you will see a growing number of records in the Redis database. It will eventually consume all memory assigned to the head pod. From our investigation, it seems that those records are never deleted.

This is a known issue actually (the function is not GC'ed). We are planning to take a look at this next quarter to fix it. We can consider as a even higher priority as this seems to cause memory issue in the cluster cc @scv119

Besides, I wonder why you have so many ray client server processes? It seems like each of them uses 0.5% of memory, and 3 of them already exceeds the memory usage of Redis (which is 1.4%).

rkooo567 commented 2 years ago

Also, is there a special reason why the head node only uses 2 CPUs?

Head node already has 4\~5 Ray processes, each of which probably uses more than a couple threads, so having 2 CPUs are probably too small (even if you don't schedule any Ray tasks on a head node). In our CI, we use at least 4~8 CPUs on a head node IIRC.

scv119 commented 2 years ago

yeah, sound we should prioritize the fix. let's sync on this topic.

ericl commented 2 years ago

It sounds like we should close this, and prioritize the known job cleanup issue, does that seem right?

rkooo567 commented 2 years ago

Yeah. I’ve been syncing with him through slack (he’s running the job and track memory usage now), and it seems like it’s the redis memory issue with function table (so function table not gc’ed).

I am asking him again now if he’s observing the same thing (it’s been 3 days since he started his cluster today)

rkooo567 commented 2 years ago

I think we can close this and focus on function table work

rkooo567 commented 2 years ago

Let me temporarily re-open until the customer confirms the function table is the root cause (as P1). (they said it is 2 days ago, but they've been running the cluster 2 additional days). And we can close it as a duplicate.

rkooo567 commented 2 years ago

Okay. Confirmed it is duplicate of https://github.com/ray-project/ray/issues/8822. cc @ericl is it possible to prioritize this problem next quarter?

Also, they are seeing DASHBOARD_AGENT_PORT_PREFIX key increasing, but it is relatively small leak (about 50KB for 3 days). So I am not sure if it should be urgently handled.

ericl commented 2 years ago

Yes, we should have enough time reserved to tackle these P1s.

On Thu, Dec 16, 2021, 12:46 AM SangBin Cho @.***> wrote:

Closed #21016 https://github.com/ray-project/ray/issues/21016.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/21016#event-5776720280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSWS34Y4JG3J3GMA443URGRQDANCNFSM5JZ3T3QQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

John-Almardeny commented 11 months ago

I am using the latest Ray==2.6. And there is a massive increase in memory in the Head and Worker nodes accumulated over time! Even if we leave the system to cool down for hours without sending it any job requests, the memory does not go down!

rkooo567 commented 11 months ago

@John-Almardeny can you actually try Ray 2.7? We fixed one major memory leak bug that was caused by gRPC regressions in Ray 2.7. Also cc @jjyao to follow up

John-Almardeny commented 11 months ago

@John-Almardeny can you actually try Ray 2.7? We fixed one major memory leak bug that was caused by gRPC regressions in Ray 2.7. Also cc @jjyao to follow up

@rkooo567 I switched to the latest Ray==2.7.2. The memory accumulation rate seems to be less than before, but still, the memory does not go down in the HEAD Node (and other worker nodes) if we leave the system to cool down for hours without sending it any job requests!

rkooo567 commented 11 months ago

@John-Almardeny. What's the rate of the leak, and if it is severe, could you create a new issue with a repro script? also note that Ray stores lots of data in memory, so some degree of memory growing up is not unexpected (E.g., whenever you schedule an actor the head node should store the metadata. If we have more than 10K+ metadata, we delete the oldest dead actor metadata)

John-Almardeny commented 11 months ago

@rkooo567

The scenario unfolds as follows:

We created 300 Actors.
Data is transmitted to each Actor at regular intervals.
The Actor conducts calculations and subsequently returns the results.
There is no additional creation or removal of Actors.

Initially, the Head Node begins with a memory consumption of a few hundred units. After creating the Actors, this consumption slightly increases, as you previously explained (which is not problematic). As we send data and receive results from the Actors (as outlined in Steps 2 and 3), the memory consumption of the Head Node increases (which is also acceptable).

The issue is that the Head Node does not release the memory when there is no communication with the Ray System. It should cool down and remove the held memory. In other words, even when we cease sending data to the System, after a few hours, the memory utilization of the Head Node remains high and fails to decrease.

The script in the worker nodes should be irrelevant, as the memory leak is in the Head Noden especially since no scripts are running on the Head Node.

ray-project / ray