Ray Data: read_mongo support parallel load from shards

Description

MongoDataSource get_read_tasks uses a single uri for a sharded collection. But if we look at MongoDB connector for Spark, the ShardedPartitioner uses internal metadata to get hosts for diferent shards, which allows Spark to load partitions from different mongod processes in parallel.

Wish MongoDataSource can implement same logic to unlock parallel loading.

Use case

We use MongoDB to save image annotation json. And we wish to use Ray Data to load millions of those json documents into training process.

ray-project / ray

Ray Data: read_mongo support parallel load from shards #41353

Description

Use case