ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.59k stars 5.71k forks source link

Ray Data: read_mongo support parallel load from shards #41353

Open wolvever opened 11 months ago

wolvever commented 11 months ago

Description

MongoDataSource get_read_tasks uses a single uri for a sharded collection. But if we look at MongoDB connector for Spark, the ShardedPartitioner uses internal metadata to get hosts for diferent shards, which allows Spark to load partitions from different mongod processes in parallel.

Wish MongoDataSource can implement same logic to unlock parallel loading.

Use case

We use MongoDB to save image annotation json. And we wish to use Ray Data to load millions of those json documents into training process.

anyscalesam commented 11 months ago

Thanks for raising this; we are reviewing overall Spark on Ray supportability and will take this into account,