Open jrrpanix opened 1 year ago
Seems you are using ray client which needs to transfer the large object to the remote cluster for ray.put
.
Can you avoid using Ray client (it's not recommended to use) and use Ray jobs (https://docs.ray.io/en/releases-2.6.1/cluster/running-applications/job-submission/index.html) instead?
@jjyao, can you elaborate how ray.put works in a client mode? Does it put data into the object store of the local node or sends data to a node of the connected cluster?
@jjyao, @rkooo567, just a friendly reminder.
in the client, there's a server called proxy server. When you call ray.put in your local machine, it sends gRPC request to the proxy server, and proxy server stores the result using ray.put (within a cluster). Crash usually happens because local -> remote gRPC request cannot handle a large object iiuc
also cc @rynewang for more details.
@rkooo567 is this a grpc bug? it should not segfault despite large requests no?
What happened + What you expected to happen
The following 2 lines of code crash on a cluster, if the size is < 1GB it works
crashes
df = pd.DataFrame(np.random.randn(8_000_000, 70)) dref = ray.put(df) Segmentation fault (core dumped)
ok
df = pd.DataFrame(np.random.randn(1_000_000, 70)) dref = ray.put(df) Segmentation fault (core dumped)
Versions / Dependencies
ClientContext(dashboard_url='100.72.104.139:8265', python_version='3.9.7', ray_version='2.4.0', ray_commit='4479f66d4db967d3c9dd0af2572061276ba926ba', protocol_version='2022-12-06', _num_clients=2, _context_to\ _restore=<ray.util.client._ClientContext object at 0x7f71e488e2b0>) {'node:100.72.104.139': 1.0, 'CPU': 28.0, 'object_store_memory': 77168374578.0, 'memory': 257698037760.0, 'node:100.72.110.238': 1.0}
uname -v
46~20.04.1-Ubuntu SMP Wed Jul 19 15:40:00 UTC 2023
Reproduction script
it only crashes on a remote cluster, ti works locally
import ray import pandas as pd import numpy as np
context = ray.init(address="ray://10.247.3.57:10001") print(context) df = pd.DataFrame(np.random.randn(8_000_000, 70)) dref = ray.put(df) # <- crashes here segmentation fault
Issue Severity
Medium: It is a significant difficulty but I can work around it.