Open bveeramani opened 2 months ago
@raulchen FYI
Seems potentially related to #44507.
Just dropping this here to remind me/us to follow up with canva in this thread when this gets closed -- I got pinged that the pre-req was closed on the Ray Core side https://github.com/anyscale/product/issues/27659
Details: User: (stefan-canva) Channel: external-canva-anyscale Time: 2024-03-29 7:12:53 Original Thread Triage Thread
Message: Hi Anyscale team! I had this error for the first time just now:
tches(AestheticScorerWorker)))
At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object 379ce838985ae5efffffffffffffffffffffffff1400000002000000. To see information about where this ObjectRef was created in
Python, set the environment variable RAY_record_ref_creation_sites=1 during ray start
and `ray.init()````
Basically the job running from https://console.anyscale.com/o/canva-org/workspaces/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7/ses_sb4axbylswdarsfv64wmsetla9this workspace> hanged for a couple of minutes just before finishing, and then it threw this error. I used this https://console.anyscale.com/o/canva-org/workspaces/edit/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7?config=compute-configcompute config>, I'm now trying with https://console.anyscale.com/o/canva-org/workspaces/edit/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7?config=compute-configa new one with num_cpus=0> on the head node in the hopes that this might help.
Would be great to get some more info how this can be prevented! Thanks
Now https://github.com/ray-project/ray/pull/45071 is merged, Data can use it to get object size
What happened + What you expected to happen
I ran a batch inference workload where one of my UDFs returns rows with PIL Images. I observed spilling, and then my node failed.
The reason is that we use
pd.DataFrame.memory_usage
to compute the size of pandas blocks: https://github.com/ray-project/ray/blob/842bbcf4236e41f58d25058b0482cd05bfe9e4da/python/ray/data/_internal/pandas_block.py#L271-L273But
memory_usage()
doesn't include the memory of object dtypes like PIL images:https://pandas.pydata.org/docs/user_guide/gotchas.html#df-memory-usage
Versions / Dependencies
842bbcf4236e41f58d25058b0482cd05bfe9e4da
Reproduction script
You'll see estimated object store memory is less than 32 KiB, but you'll observe substantial object spilling.
Issue Severity
High: It blocks me from completing my task.