ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.8k stars 5.75k forks source link

[Core] DecodeError when `ray.put` a large (2GB) object #35976

Open messense opened 1 year ago

messense commented 1 year ago

What happened + What you expected to happen

When calling ray.put on a large object (size >= 2GB) in client mode, python process segfaults in protobuf 4.x library. Although it works fine with protobuf 3.20.

I have a fix for the segfault in https://github.com/protocolbuffers/upb/pull/1338, but even with that patch ray.put raise DecodeError from protobuf library so it doesn't make it work on large objects.

To me it seems that ray should implement chunked put in https://github.com/ray-project/ray/blob/609b8e6151c190d1b5f18b2bfb0d2495b63e994e/python/ray/util/client/worker.py#L498-L514

Versions / Dependencies

$ ray --version
ray, version 2.4.0

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

$ python -c 'from google import protobuf; print(protobuf.__version__)'
4.23.2

Reproduction script

from ray.core.generated.ray_client_pb2 import PutRequest, DataRequest

# data size of 2**31 bytes (2GB)
# doesn't error when data size <= 2147483646
req = PutRequest(data=b"\0" * 2147483648)
datareq = DataRequest(put=req)

Code minimized from https://github.com/ray-project/ray/blob/609b8e6151c190d1b5f18b2bfb0d2495b63e994e/python/ray/util/client/worker.py#L480-L514 so it doesn't need to include a call to ray.put.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

rkooo567 commented 1 year ago

cc @ckw017