Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
I'm trying to create a Ray Dataset from a 300 GB JSONL file for offline inference. I expected Ray Data to read the file in a streaming fashion, but instead observed Ray Data loading the whole file into memory.
I think this is how existing ray data read tasks work, but we are actually working on trying to support streaming (potentially unbounded) data sources in Ray Data.
What happened + What you expected to happen
I'm trying to create a Ray Dataset from a 300 GB JSONL file for offline inference. I expected Ray Data to read the file in a streaming fashion, but instead observed Ray Data loading the whole file into memory.
Versions / Dependencies
a46f8aed884136eaa1347edc10ad55c9e5bcd650
Reproduction script
...
Issue Severity
High: It blocks me from completing my task.