ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.89k stars 5.76k forks source link

Ray Core Hangs After Seconds of Parallel Execution #46396

Closed muazhari closed 3 months ago

muazhari commented 4 months ago

What happened + What you expected to happen

Ray Core hangs after seconds of execution. When using plain Windows, it hangs. When using Windows docker wsl2, it gets an error.

It just does not continue executing and CPU utilization drops. image

Expect to not have any errors and is correct.

Versions / Dependencies

ray==2.31.0 pymoo==0.6.1.1

Windows 11 WSL 2.2.4.0 Docker Desktop 4.31.1 (153621)

Reproduction script

The real code is more complex and private. Approximated to:

from pymoo.core.variable import Binary, Choice, Integer, Real
from pymoo.core.problem import LoopedElementwiseEvaluation, RayParallelization
from pymoo.algorithms.moo.nsga2 import RankAndCrowding
from pymoo.core.mixed import MixedVariableGA
from pymoo.optimize import minimize

from pymoo.core.problem import ElementwiseProblem

import ray

ray.shutdown()
ray.init(dashboard_host="0.0.0.0")
res = ray.available_resources()
print(res)

class MultiObjectiveMixedVariableProblem(ElementwiseProblem):

    def __init__(self, **kwargs):
        vars = {
            "b": Binary(),
            "x": Choice(options=["nothing", "multiply"]),
            "y": Integer(bounds=(-2 * 10 ** 5, 2 * 10 ** 5)),
            "z": Real(bounds=(-5 * 10 ** 3, 5 * 10 ** 3)),
        }
        super().__init__(vars=vars, n_obj=6, n_ieq_constr=0, **kwargs)

    def _evaluate(self, X, out, *args, **kwargs):
        b, x, z, y = X["b"], X["x"], X["z"], X["y"]
        f1 = z ** 2 + y ** 2
        f2 = (z + 2) ** 2 + (y - 1) ** 2
        f3 = (z ** 2) / 2 + (y + 1)
        f4 = -z ** 2
        f5 = z ** 2
        f6 = z / 2 - y - y / z

        if b:
            f2 = 100 * f2

        if x == "multiply":
            f2 = 10 * f2

        out["F"] = [f1, f2, f3, f4, f5, f6]

runner = RayParallelization(
    job_resources={
        "num_gpus": int(res["GPU"]),
        "num_cpus": int(res["CPU"]),
    }
)

problem = MultiObjectiveMixedVariableProblem(elementwise_runner=runner)

algorithm = MixedVariableGA(
    survival=RankAndCrowding()
)

res = minimize(
    problem,
    algorithm,
    verbose=True,
    seed=1
)

Issue Severity

High: It blocks me from completing my task.

mattip commented 3 months ago

Could you copy-paste the text output from the console? I am curious what print(res) is showing. What machine is this running on?

Maybe connected to #37373, which we never did completely track down. If indeed this machine has many cores, could you try limiting the resources ray.init(num_cpus=N) with N==4 and slowly increase it until you see the problem?

muazhari commented 3 months ago

Currently, the code has no error when running in all cores on Ubuntu WSL, Docker WSL, and Windows. However, when using Windows and Docker WSL, the task manager doesn't monitor the real resource usage. Also, Ubuntu WSL differs from the rest. Docker WSL and Windows may utilize all cores, and execution is completed in 48 seconds. Meanwhile, Ubuntu WSL stuck to around 50% cores utilization, too slow to wait for the execution to complete. Need to consider, that monitoring is a problem too because is not reliable.

muazhari commented 3 months ago

Update for Ubuntu WSL:

If I change to not supplying ray option resources directly as in this code:

from pymoo.core.variable import Binary, Choice, Integer, Real
from pymoo.core.problem import RayParallelization
from pymoo.algorithms.moo.nsga2 import RankAndCrowding
from pymoo.core.mixed import MixedVariableGA
from pymoo.optimize import minimize

from pymoo.core.problem import ElementwiseProblem

import ray

ray.shutdown()
ray.init(dashboard_host="0.0.0.0")
res = ray.available_resources()
print(res)

class MultiObjectiveMixedVariableProblem(ElementwiseProblem):

    def __init__(self, **kwargs):
        vars = {
            "b": Binary(),
            "x": Choice(options=["nothing", "multiply"]),
            "y": Integer(bounds=(-2 * 10 ** 5, 2 * 10 ** 5)),
            "z": Real(bounds=(-5 * 10 ** 3, 5 * 10 ** 3)),
        }
        super().__init__(vars=vars, n_obj=6, n_ieq_constr=0, **kwargs)

    def _evaluate(self, X, out, *args, **kwargs):
        b, x, z, y = X["b"], X["x"], X["z"], X["y"]
        f1 = z ** 2 + y ** 2
        f2 = (z + 2) ** 2 + (y - 1) ** 2
        f3 = (z ** 2) / 2 + (y + 1)
        f4 = -z ** 2
        f5 = z ** 2
        f6 = z / 2 - y - y / z

        if b:
            f2 = 100 * f2

        if x == "multiply":
            f2 = 10 * f2

        out["F"] = [f1, f2, f3, f4, f5, f6]

class OptimizationProblemRunner:
    def __init__(self):
        pass

    def __call__(self, f, X):
        runnable = ray.remote(f.__call__.__func__)
        futures = [runnable.remote(f, x) for x in X]
        return ray.get(futures)

    def __getstate__(self):
        state = self.__dict__.copy()
        return state

# runner = RayParallelization(
#     job_resources={
#         "num_gpus": int(res["GPU"]),
#         "num_cpus": int(res["CPU"]),
#     }
# )

runner = OptimizationProblemRunner()

problem = MultiObjectiveMixedVariableProblem(elementwise_runner=runner)

algorithm = MixedVariableGA(
    survival=RankAndCrowding()
)

res = minimize(
    problem,
    algorithm,
    verbose=True,
    seed=1
)

It behaves the same as the rest, similar execution time. However, still has invalid monitoring. image

mattip commented 3 months ago

I'm not sure I understand. The problem being described here is that the process eventually runs to completion, but along the way monitoring is suggesting the process was not executing anything, even though it was? Do I understand the problem correctly?

This might be to an imbalance in the resource allocation: there are 24 actual cores (32 logical processors) but only 32GB of memory. The processes need to share that limited memory, and on windows ray must spawn new processes, copying the memory to them, since it has no fork (that can share memory between processes).

muazhari commented 3 months ago

The code can be executed in ubuntu wsl, docker wsl, and windows instances but can't be monitored correctly in task manager. So, we can't plainly deduce something with the task manager.

If there is resource deficit, the code execution should crash and no instances are completely executed. However in docker wsl and windows instances, the code completely executed. For ubuntu wsl instance, it still can be executed but abnormally slow and not yet validated to complete execution. Unknowingly, this abnormally slow execution randomly solved (sometime doesn't work) after not supplying option resource arguments directly to ray remote.

mattip commented 3 months ago

I think everything is working as it should. TaskManager is very limited in its reporting. Can you try other tools? Ray comes with a dashboard, or you can use ResourceMonitor to get a more fine grained view into what is going on.

I made this suggestion above, did you try it?

If indeed this machine has many cores, could you try limiting the resources ray.init(num_cpus=N) with N==4 and slowly increase it until you see the problem?

muazhari commented 3 months ago

I made this suggestion above, did you try it?

Umm, even though the core numbers are not being reduced, the code execution has no problem (except for monitoring & option resource argument cases). So, I can't reach the upper bound to see the problem.

muazhari commented 3 months ago

Can you try other tools? Ray comes with a dashboard, or you can use ResourceMonitor to get a more fine grained view into what is going on.

In Ubuntu WSL instance (the code version without direct option resource argument), the CPU usage surged to around 64% at the start but mostly until the end of execution at around 12% shown in the ray dashboard. The code still can be completely executed in around 34 seconds.

image

muazhari commented 3 months ago

I think everything is working as it should

Yes, for now except for the monitoring and option resource argument. I think some dependencies are changed/fixed that relate to not being crashed.

muazhari commented 3 months ago

Is this statement true?

The CPU utilization above is not always near 100% because of RAM deficit.

It bit weird if is true because only of the RAM deficit. At some time with a similar initial state, the CPU utilization can always be near 100% until completely executed.

mattip commented 3 months ago

Maybe reducing the number of cores will allow the true CPU processing utilization to appear in the monitoring tools, since it should no longer be masked by memory bandwith problems (if indeed that is what is going on).

muazhari commented 3 months ago

Maybe reducing the number of cores will allow the true CPU processing utilization to appear in the monitoring tools, since it should no longer be masked by memory bandwith problems (if indeed that is what is going on).

Okay then, currently we can conclude the hangs/crashes can't be reproduced and the utilization problem is because of my RAM deficit.

mattip commented 3 months ago

I will close this for now, feel free to reopen or open a new issue if you have more info.