ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
31.97k stars 5.45k forks source link

[Data] Ray Data doesn't account for object store memory from object dtypes #44577

Open bveeramani opened 2 months ago

bveeramani commented 2 months ago

What happened + What you expected to happen

I ran a batch inference workload where one of my UDFs returns rows with PIL Images. I observed spilling, and then my node failed.

The reason is that we use pd.DataFrame.memory_usage to compute the size of pandas blocks: https://github.com/ray-project/ray/blob/842bbcf4236e41f58d25058b0482cd05bfe9e4da/python/ray/data/_internal/pandas_block.py#L271-L273

But memory_usage() doesn't include the memory of object dtypes like PIL images:

The + symbol indicates that the true memory usage could be higher, because pandas does not count the memory used by values in columns with dtype=object.

https://pandas.pydata.org/docs/user_guide/gotchas.html#df-memory-usage

Versions / Dependencies

842bbcf4236e41f58d25058b0482cd05bfe9e4da

Reproduction script

You'll see estimated object store memory is less than 32 KiB, but you'll observe substantial object spilling.

import numpy as np

import ray

class Object:
    def __init__(self):
        # Each `Object` occupies >1 GiB of object store memory.
        self.data = np.zeros((1024 * 1024 * 1024), dtype=np.uint8)

def generate_data(row):
    return {"data": Object()}

ds = ray.data.range(100).map(generate_data)
for _ in ds.iter_batches(batch_size=None, batch_format="pandas"):
    pass

Issue Severity

High: It blocks me from completing my task.

bveeramani commented 2 months ago

@raulchen FYI

omatthew98 commented 2 months ago

Seems potentially related to #44507.

alexr-anyscale commented 1 month ago

Just dropping this here to remind me/us to follow up with canva in this thread when this gets closed -- I got pinged that the pre-req was closed on the Ray Core side https://github.com/anyscale/product/issues/27659

Details: User: (stefan-canva) Channel: external-canva-anyscale Time: 2024-03-29 7:12:53 Original Thread Triage Thread

Message: Hi Anyscale team! I had this error for the first time just now:

tches(AestheticScorerWorker)))
At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object 379ce838985ae5efffffffffffffffffffffffff1400000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during ray start and `ray.init()```` Basically the job running from https://console.anyscale.com/o/canva-org/workspaces/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7/ses_sb4axbylswdarsfv64wmsetla9this workspace> hanged for a couple of minutes just before finishing, and then it threw this error. I used this https://console.anyscale.com/o/canva-org/workspaces/edit/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7?config=compute-configcompute config>, I'm now trying with https://console.anyscale.com/o/canva-org/workspaces/edit/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7?config=compute-configa new one with num_cpus=0> on the head node in the hopes that this might help.

Would be great to get some more info how this can be prevented! Thanks

rynewang commented 3 weeks ago

Now https://github.com/ray-project/ray/pull/45071 is merged, Data can use it to get object size