Open ScottShingler opened 6 months ago
In my own scenario of using uvicorn + fastapi + single-node ray, similar situation occurs. After high QPS testing, the gcs server did not fall back like uvicorn did
In my own scenario of using uvicorn + fastapi + single-node ray, similar situation occurs. After high QPS testing, the gcs server did not fall back like uvicorn did
However, I found some solutions in this issue: export RAY_task_events_max_num_task_in_gcs=100
can significantly reduce the memory usage of GCS."
It is also similar to this issue https://github.com/ray-project/ray/issues/43253
Per https://github.com/ray-project/ray/issues/43253 can you retry with export RAY_task_events_max_num_task_in_gcs=100
?
I'll set up a test to run over the weekend and report back.
I also had a similar problem to this and tried the #43253 solution, but i wasn't really successful, can you tell me how you solved this issue?
I also had a similar problem to this and tried the #43253 solution, but i wasn't really successful, can you tell me how you solved this issue?
I tried again using the exact same variable RAY_task_events_max_num_task_in_gcs=100 as described in #43253 and really worked out for me, I was using the variable in capital letters in my enviroment before, didn't know it was case sensitive in my first try.
Here are the results with export RAY_task_events_max_num_task_in_gcs=100
:
It looks like setting that environment variable does curtail the memory usage of the gcs_server
process.
Is this environment variable documented anywhere outside of the code? So far I've found the following:
src/ray/gcs/gcs_server/gcs_task_manager.h:
/// When the maximal number of task events tracked specified by
/// `RAY_task_events_max_num_task_in_gcs` is exceeded, older events (approximately by
/// insertion order) will be dropped.
What are the implications of dropping an older event?
src/ray/common/ray_config_def.h:
/// The number of tasks tracked in GCS for task state events. Any additional events
/// from new tasks will evict events of tasks reported earlier.
/// Setting the value to -1 allows for unlimited task events stored in GCS.
RAY_CONFIG(int64_t, task_events_max_num_task_in_gcs, 100000)
Apparently the default value is 100,000. What are the implications for Ray's performance when setting this value orders of magnitude lower to 100?
What happened + What you expected to happen
Running Ray Core using a driver that repeatedly invokes a method on an actor results in a continuous overall increase in the peak RSS memory usage of the
ray/core/src/ray/gcs/gcs_server
process:Over the course of ~112 hours, the values of the peaks increased from an initial value of ~595 MB to ~662 MB. That works out to roughly 598 KB per hour.
This plot and the data used to create it can be downloaded here.
Versions / Dependencies
ray 2.20.0
Reproduction script
I have created a repo that contains the reproduction script, as well as scripts to collect and plot the data: https://github.com/Prolucid/ray-memleak-repro.
Issue Severity
Medium: It is a significant difficulty but I can work around it.