ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.95k stars 5.58k forks source link

[Data] RandomAccessDataset.multiget return unexpected values for missing keys. #44768

Open sunyakun opened 4 months ago

sunyakun commented 4 months ago

What happened + What you expected to happen

the ray.data.RandomAccessDataset.multiget expected return a None for missing records, in fact, I got an unexpected value for the missing key.

I find this PR update the _RandomAccessWorker.multiget: https://github.com/ray-project/ray/pull/24825, and it use the np.searchsorted to speed up the multiget, but the np.searchsorted will return the insertion points for missing records and it use the search result directly to get the row from the block without test col[i] == key, just like the code here: https://github.com/ray-project/ray/blob/d8c7234aeec2d7d06d218a58ae6730b9335d3ca8/python/ray/data/random_access_dataset.py#L266-L269

Versions / Dependencies

Ray: latest master Python: 3.9.2 OS: linux

Reproduction script

import ray
import ray.data

kv_store = ray.data.from_items(
    [i for i in range(0, 1000, 2)]
).repartition(5).to_random_access_dataset(key="item", num_workers=1)

print(ray.get(kv_store.get_async(1)), ray.get(kv_store.get_async(901)))
# output: None None

print(kv_store.multiget([1, 901]))
# output: [{'item': 2}, {'item': 902}]

Issue Severity

None

tespent commented 4 months ago

I can reproduce this problem and I created a pull request #44769 trying to fix this.