Open sunyakun opened 4 months ago
the ray.data.RandomAccessDataset.multiget expected return a None for missing records, in fact, I got an unexpected value for the missing key.
I find this PR update the _RandomAccessWorker.multiget: https://github.com/ray-project/ray/pull/24825, and it use the np.searchsorted to speed up the multiget, but the np.searchsorted will return the insertion points for missing records and it use the search result directly to get the row from the block without test col[i] == key, just like the code here: https://github.com/ray-project/ray/blob/d8c7234aeec2d7d06d218a58ae6730b9335d3ca8/python/ray/data/random_access_dataset.py#L266-L269
Ray: latest master Python: 3.9.2 OS: linux
import ray import ray.data kv_store = ray.data.from_items( [i for i in range(0, 1000, 2)] ).repartition(5).to_random_access_dataset(key="item", num_workers=1) print(ray.get(kv_store.get_async(1)), ray.get(kv_store.get_async(901))) # output: None None print(kv_store.multiget([1, 901])) # output: [{'item': 2}, {'item': 902}]
None
I can reproduce this problem and I created a pull request #44769 trying to fix this.
What happened + What you expected to happen
the ray.data.RandomAccessDataset.multiget expected return a None for missing records, in fact, I got an unexpected value for the missing key.
I find this PR update the _RandomAccessWorker.multiget: https://github.com/ray-project/ray/pull/24825, and it use the np.searchsorted to speed up the multiget, but the np.searchsorted will return the insertion points for missing records and it use the search result directly to get the row from the block without test col[i] == key, just like the code here: https://github.com/ray-project/ray/blob/d8c7234aeec2d7d06d218a58ae6730b9335d3ca8/python/ray/data/random_access_dataset.py#L266-L269
Versions / Dependencies
Ray: latest master Python: 3.9.2 OS: linux
Reproduction script
Issue Severity
None