Open Tobin-rgb opened 1 month ago
filter
add an indices mapping on top of the dataset, so sort
has to gather all the rows that are kept to form a new Arrow table and sort the table. Gathering all the rows can take some time, but is a necessary step. You can try calling ds = ds.flatten_indices()
before sorting to remove the indices mapping.
Describe the bug
as the tittle says ...
Steps to reproduce the bug
sort
seems to be normal.but
sort
afterfilter
is extremely slow.Expected behavior
Is this a bug, or is it a misuse of the
sort
function?Environment info
datasets
version: 2.20.0huggingface_hub
version: 0.23.4fsspec
version: 2023.10.0