ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.31k stars 5.64k forks source link

[Data] `read_json` reads whole file into memory #46485

Open bveeramani opened 2 months ago

bveeramani commented 2 months ago

What happened + What you expected to happen

I'm trying to create a Ray Dataset from a 300 GB JSONL file for offline inference. I expected Ray Data to read the file in a streaming fashion, but instead observed Ray Data loading the whole file into memory.

Versions / Dependencies

a46f8aed884136eaa1347edc10ad55c9e5bcd650

Reproduction script

...

Issue Severity

High: It blocks me from completing my task.

BabyChouSr commented 2 months ago

Strong +1! This would be super helpful :)

Superskyyy commented 2 months ago

I think this is how existing ray data read tasks work, but we are actually working on trying to support streaming (potentially unbounded) data sources in Ray Data.