Open eladsegal opened 2 years ago
Hi ! It's slower indeed because a datasets on which select
/shard
/train_test_split
/shuffle
has been called has to do additional steps to retrieve the data of the dataset table in the right order.
Indeed, if you call dataset.select([0, 5, 10])
, the underlying table of the dataset is not altered to keep the examples at index 0, 5, and 10. Instead, an indices mapping is added on top of the table, that says that the first example is at index 0, the second at index 5 and the last one at index 10.
Therefore accessing the examples of the dataset is slower because of the additional step that uses the indices mapping.
The step that takes the most time is to query the dataset table from a list of indices here:
In your case it can be made significantly faster by checking if the indices are contiguous. If they're contiguous, we could pass a python slice
or range
instead of a list of integers to _query_table
. This way _query_table
will do only one lookup to get the queried batch instead of batch_size
lookups.
Given that calling select
with contiguous indices is a common use case I'm in favor of implementing such an optimization :)
Let me know what you think
Hi, thanks for the response! I still don't understand why it is so much slower than iterating and saving:
from datasets import load_dataset
original = load_dataset("squad", split="train")
original.to_json("from_original.json") # Takes 0 seconds
selected_subset1 = original.select([i for i in range(len(original))])
selected_subset1.to_json("from_select1.json") # Takes 99 seconds
selected_subset2 = original.select([i for i in range(int(len(original) / 2))])
selected_subset2.to_json("from_select2.json") # Takes 47 seconds
selected_subset3 = original.select([i for i in range(len(original)) if i % 2 == 0])
selected_subset3.to_json("from_select3.json") # Takes 49 seconds
import json
import time
def fast_to_json(dataset, path):
start = time.time()
with open(path, mode="w") as f:
for example in dataset:
f.write(json.dumps(example, separators=(',', ':')) + "\n")
end = time.time()
print(f"Saved {len(dataset)} examples to {path} in {end - start} seconds.")
fast_to_json(original, "from_original_fast.json")
fast_to_json(selected_subset1, "from_select1_fast.json")
fast_to_json(selected_subset2, "from_select2_fast.json")
fast_to_json(selected_subset3, "from_select3_fast.json")
Saved 87599 examples to from_original_fast.json in 8 seconds.
Saved 87599 examples to from_select1_fast.json in 10 seconds.
Saved 43799 examples to from_select2_fast.json in 6 seconds.
Saved 43800 examples to from_select3_fast.json in 5 seconds.
There are slight differences between what you're doing and what to_json
is actually doing.
In particular to_json
currently converts batches of rows (as an arrow table) to a pandas dataframe, and then to JSON Lines. From your benchmark it looks like it's faster if we don't use pandas.
Thanks for investigating, I think we can optimize to_json
significantly thanks to your test.
Thanks for your observations, @eladsegal! I spent some time with this and tried different approaches. Turns out that https://github.com/huggingface/datasets/blob/bb13373637b1acc55f8a468a8927a56cf4732230/src/datasets/io/json.py#L100 is giving the problem when we use to_json
after select
. This is when indices
parameter in query_table
is not None
(if it is None
then to_json
should work as expected)
In order to circumvent this problem, I found out instead of doing Arrow Table -> Pandas-> JSON we can directly go to JSON by using to_pydict()
which is a little slower than the current approach but at least select
works properly now. Lmk what you guys think of it @lhoestq, @eladsegal?
Sounds good to me ! Feel free to also share your benchmarks for reference @bhavitvyamalik
Posting it in @eladsegal's format:
For squad
:
Saving examples using current to_json
in 3.63 secs
Saving examples to from_select1_fast.json
in 5.00 secs
Saving examples to from_select2_fast.json
in 2.45 secs
Saving examples to from_select3_fast.json
in 2.50 secs
For squad_v2
:
Saving examples using current to_json
in 5.26 secs
Saving examples to from_select1_fast.json
in 7.54 secs
Saving examples to from_select2_fast.json
in 3.80 secs
Saving examples to from_select3_fast.json
in 3.67 secs
Describe the bug
Saving a dataset to JSON with
to_json
is extremely slow after using.select
on the original dataset.Steps to reproduce the bug
Environment info
datasets
version: master (https://github.com/huggingface/datasets/commit/6090f3cfb5c819f441dd4a4bb635e037c875b044)