ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.02k stars 5.78k forks source link

[Core] Possible memory leak in gcs_server #45338

Open ScottShingler opened 6 months ago

ScottShingler commented 6 months ago

What happened + What you expected to happen

Running Ray Core using a driver that repeatedly invokes a method on an actor results in a continuous overall increase in the peak RSS memory usage of the ray/core/src/ray/gcs/gcs_server process:

graph

Over the course of ~112 hours, the values of the peaks increased from an initial value of ~595 MB to ~662 MB. That works out to roughly 598 KB per hour.

This plot and the data used to create it can be downloaded here.

Versions / Dependencies

ray 2.20.0

Reproduction script

I have created a repo that contains the reproduction script, as well as scripts to collect and plot the data: https://github.com/Prolucid/ray-memleak-repro.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

lmsh7 commented 5 months ago

In my own scenario of using uvicorn + fastapi + single-node ray, similar situation occurs. After high QPS testing, the gcs server did not fall back like uvicorn did image

lmsh7 commented 5 months ago

In my own scenario of using uvicorn + fastapi + single-node ray, similar situation occurs. After high QPS testing, the gcs server did not fall back like uvicorn did image

However, I found some solutions in this issue: export RAY_task_events_max_num_task_in_gcs=100 can significantly reduce the memory usage of GCS." image

lmsh7 commented 5 months ago

It is also similar to this issue https://github.com/ray-project/ray/issues/43253

rynewang commented 5 months ago

Per https://github.com/ray-project/ray/issues/43253 can you retry with export RAY_task_events_max_num_task_in_gcs=100?

ScottShingler commented 5 months ago

I'll set up a test to run over the weekend and report back.

azevedo-f commented 5 months ago

I also had a similar problem to this and tried the #43253 solution, but i wasn't really successful, can you tell me how you solved this issue?

azevedo-f commented 5 months ago

I also had a similar problem to this and tried the #43253 solution, but i wasn't really successful, can you tell me how you solved this issue?

I tried again using the exact same variable RAY_task_events_max_num_task_in_gcs=100 as described in #43253 and really worked out for me, I was using the variable in capital letters in my enviroment before, didn't know it was case sensitive in my first try.

ScottShingler commented 5 months ago

Here are the results with export RAY_task_events_max_num_task_in_gcs=100:

plot

It looks like setting that environment variable does curtail the memory usage of the gcs_server process.

Is this environment variable documented anywhere outside of the code? So far I've found the following:

src/ray/gcs/gcs_server/gcs_task_manager.h:

/// When the maximal number of task events tracked specified by
/// `RAY_task_events_max_num_task_in_gcs` is exceeded, older events (approximately by
/// insertion order) will be dropped.

What are the implications of dropping an older event?

src/ray/common/ray_config_def.h:

/// The number of tasks tracked in GCS for task state events. Any additional events
/// from new tasks will evict events of tasks reported earlier.
/// Setting the value to -1 allows for unlimited task events stored in GCS.
RAY_CONFIG(int64_t, task_events_max_num_task_in_gcs, 100000)

Apparently the default value is 100,000. What are the implications for Ray's performance when setting this value orders of magnitude lower to 100?