MongoDataSource get_read_tasks uses a single uri for a sharded collection. But if we look at MongoDB connector for Spark, the ShardedPartitioner uses internal metadata to get hosts for diferent shards, which allows Spark to load partitions from different mongod processes in parallel.
Wish MongoDataSource can implement same logic to unlock parallel loading.
Use case
We use MongoDB to save image annotation json. And we wish to use Ray Data to load millions of those json documents into training process.
Description
MongoDataSource
get_read_tasks
uses a single uri for a sharded collection. But if we look at MongoDB connector for Spark, the ShardedPartitioner uses internal metadata to get hosts for diferent shards, which allows Spark to load partitions from different mongod processes in parallel.Wish MongoDataSource can implement same logic to unlock parallel loading.
Use case
We use MongoDB to save image annotation json. And we wish to use Ray Data to load millions of those json documents into training process.