ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.99k stars 5.78k forks source link

[Core] Large ray.put fails/crashes #38713

Open jrrpanix opened 1 year ago

jrrpanix commented 1 year ago

What happened + What you expected to happen

The following 2 lines of code crash on a cluster, if the size is < 1GB it works

crashes

df = pd.DataFrame(np.random.randn(8_000_000, 70)) dref = ray.put(df) Segmentation fault (core dumped)

ok

df = pd.DataFrame(np.random.randn(1_000_000, 70)) dref = ray.put(df) Segmentation fault (core dumped)

Versions / Dependencies

ClientContext(dashboard_url='100.72.104.139:8265', python_version='3.9.7', ray_version='2.4.0', ray_commit='4479f66d4db967d3c9dd0af2572061276ba926ba', protocol_version='2022-12-06', _num_clients=2, _context_to\ _restore=<ray.util.client._ClientContext object at 0x7f71e488e2b0>) {'node:100.72.104.139': 1.0, 'CPU': 28.0, 'object_store_memory': 77168374578.0, 'memory': 257698037760.0, 'node:100.72.110.238': 1.0}

uname -v

46~20.04.1-Ubuntu SMP Wed Jul 19 15:40:00 UTC 2023

Reproduction script

it only crashes on a remote cluster, ti works locally

import ray import pandas as pd import numpy as np

context = ray.init(address="ray://10.247.3.57:10001") print(context) df = pd.DataFrame(np.random.randn(8_000_000, 70)) dref = ray.put(df) # <- crashes here segmentation fault

Issue Severity

Medium: It is a significant difficulty but I can work around it.

jjyao commented 1 year ago

Seems you are using ray client which needs to transfer the large object to the remote cluster for ray.put.

Can you avoid using Ray client (it's not recommended to use) and use Ray jobs (https://docs.ray.io/en/releases-2.6.1/cluster/running-applications/job-submission/index.html) instead?

YarShev commented 11 months ago

@jjyao, can you elaborate how ray.put works in a client mode? Does it put data into the object store of the local node or sends data to a node of the connected cluster?

YarShev commented 9 months ago

@jjyao, @rkooo567, just a friendly reminder.

rkooo567 commented 9 months ago

in the client, there's a server called proxy server. When you call ray.put in your local machine, it sends gRPC request to the proxy server, and proxy server stores the result using ray.put (within a cluster). Crash usually happens because local -> remote gRPC request cannot handle a large object iiuc

rkooo567 commented 9 months ago

also cc @rynewang for more details.

rynewang commented 9 months ago

@rkooo567 is this a grpc bug? it should not segfault despite large requests no?