ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.32k stars 5.64k forks source link

[Ray Java] Memory leak in object store when Java application invokes a remote Python function #45675

Closed yucai closed 2 months ago

yucai commented 4 months ago

What happened + What you expected to happen

Reproduction steps:

  1. Compile and run the provided JavaCallPython class.
    java -classpath /mount/data/yucai/ray-demo-1.0-SNAPSHOT-jar-with-dependencies.jar -Dray.address=127.0.0.1:6379 -Dray.job.code-search-path=/mount/data/yucai/ JavaCallPython
  2. Observe the memory usage in the object store.

Expected Behavior: The memory usage should remain stable over time, without significant increases that lead to disk spilling.

Actual Behavior: Continuous increase in object memory usage is observed when the remote Python function do_test is called, eventually leading to spill to disk.

Versions / Dependencies

Ray: 2.9.0

Reproduction script

import io.ray.api.ObjectRef;
import io.ray.api.Ray;
import io.ray.serve.api.Serve;
import io.ray.serve.deployment.Application;
import io.ray.serve.generated.DeploymentLanguage;
import io.ray.serve.handle.DeploymentHandle;

/*
app@nsfw-ray-yucai-head-lf78j:/mount/data/yucai$ cat test_python_deployment.py
from ray import serve

@serve.deployment
class Counter(object):
    def __init__(self, value):
        self.value = int(value)

    def do_test(self, data):
        return str(self.value)
 */
public class JavaCallPython {

    public static void main(String[] args) throws Exception {
        Serve.start(null);

        Application deployment = Serve.deployment()
                .setLanguage(DeploymentLanguage.PYTHON)
                .setName("JavaCallPython")
                .setDeploymentDef("test_python_deployment.Counter")
                .setNumReplicas(1)
                .bind(28);
        DeploymentHandle handle = Serve.run(deployment).get();

        for (int i = 0; i < 1000; i++) {
            StringBuilder sb = new StringBuilder(20000000);
            for (int n = 0; n < 20000000; n++) {
                sb.append('a');
            }
            String data = sb.toString();

            System.out.println(handle.method("do_test").remote(data).result());
            data = null;

            System.out.println("iter: " + i);
            Thread.sleep(500);
        }
    }
}

Issue Severity

High: It blocks me from completing my task.

yucai commented 4 months ago
image
jjyao commented 4 months ago

@edoakes is java calling python something we still maintain?

edoakes commented 4 months ago

@edoakes is java calling python something we still maintain?

It's not in active development but we should fix critical bug fixes such as memory leaks. Would classify this as P1.

yucai commented 4 months ago

Dear @edoakes and @jjyao, we are currently undertaking a POC with Ray that is of critical importance to our company's AI platform strategy. We have encountered this significant blocking issue that is impacting our production application which utilizes Ray.

To address this, we've submitted a pull request with a proposed workaround: https://github.com/ray-project/ray/pull/45729.

We would greatly appreciate it if you could review the PR at your earliest convenience. Your expertise would be invaluable in helping us resolve this issue.

CC: @anyscalesam, @kevin85421

jjyao commented 4 months ago

@yucai thanks for the contribution. The PR is merged but I think the memory leak still exists but no longer block you?

anyscalesam commented 3 months ago

@yucai can you confirm here ^ @jjyao

EDIT: Jul '24 > should we downgrade priority for this since it's no longer blocking?

anyscalesam commented 2 months ago

closing - please reopen if still relevant @yucai