huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.27k stars 2.7k forks source link

`.to_json` is extremely slow after `.select` #3419

Open eladsegal opened 2 years ago

eladsegal commented 2 years ago

Describe the bug

Saving a dataset to JSON with to_json is extremely slow after using .select on the original dataset.

Steps to reproduce the bug

from datasets import load_dataset

original = load_dataset("squad", split="train")
original.to_json("from_original.json")  # Takes 0 seconds

selected_subset1 = original.select([i for i in range(len(original))])
selected_subset1.to_json("from_select1.json")  # Takes 212 seconds

selected_subset2 = original.select([i for i in range(int(len(original) / 2))])
selected_subset2.to_json("from_select2.json")  # Takes 90 seconds

Environment info

lhoestq commented 2 years ago

Hi ! It's slower indeed because a datasets on which select/shard/train_test_split/shuffle has been called has to do additional steps to retrieve the data of the dataset table in the right order.

Indeed, if you call dataset.select([0, 5, 10]), the underlying table of the dataset is not altered to keep the examples at index 0, 5, and 10. Instead, an indices mapping is added on top of the table, that says that the first example is at index 0, the second at index 5 and the last one at index 10.

Therefore accessing the examples of the dataset is slower because of the additional step that uses the indices mapping.

The step that takes the most time is to query the dataset table from a list of indices here:

https://github.com/huggingface/datasets/blob/047dc756ed20fbf06e6bcaf910464aba0e20610a/src/datasets/formatting/formatting.py#L61-L63

In your case it can be made significantly faster by checking if the indices are contiguous. If they're contiguous, we could pass a python slice or range instead of a list of integers to _query_table. This way _query_table will do only one lookup to get the queried batch instead of batch_size lookups.

Given that calling select with contiguous indices is a common use case I'm in favor of implementing such an optimization :) Let me know what you think

eladsegal commented 2 years ago

Hi, thanks for the response! I still don't understand why it is so much slower than iterating and saving:

from datasets import load_dataset

original = load_dataset("squad", split="train")
original.to_json("from_original.json")  # Takes 0 seconds

selected_subset1 = original.select([i for i in range(len(original))])
selected_subset1.to_json("from_select1.json")  # Takes 99 seconds

selected_subset2 = original.select([i for i in range(int(len(original) / 2))])
selected_subset2.to_json("from_select2.json")  # Takes 47 seconds

selected_subset3 = original.select([i for i in range(len(original)) if i % 2 == 0])
selected_subset3.to_json("from_select3.json")  # Takes 49 seconds

import json
import time
def fast_to_json(dataset, path):
    start = time.time()
    with open(path, mode="w") as f:
        for example in dataset:
            f.write(json.dumps(example, separators=(',', ':')) + "\n")
    end = time.time()
    print(f"Saved {len(dataset)} examples to {path} in {end - start} seconds.")

fast_to_json(original, "from_original_fast.json")
fast_to_json(selected_subset1, "from_select1_fast.json")
fast_to_json(selected_subset2, "from_select2_fast.json")
fast_to_json(selected_subset3, "from_select3_fast.json")
Saved 87599 examples to from_original_fast.json in 8 seconds.
Saved 87599 examples to from_select1_fast.json in 10 seconds.
Saved 43799 examples to from_select2_fast.json in 6 seconds.
Saved 43800 examples to from_select3_fast.json in 5 seconds.
lhoestq commented 2 years ago

There are slight differences between what you're doing and what to_json is actually doing. In particular to_json currently converts batches of rows (as an arrow table) to a pandas dataframe, and then to JSON Lines. From your benchmark it looks like it's faster if we don't use pandas.

Thanks for investigating, I think we can optimize to_json significantly thanks to your test.

bhavitvyamalik commented 2 years ago

Thanks for your observations, @eladsegal! I spent some time with this and tried different approaches. Turns out that https://github.com/huggingface/datasets/blob/bb13373637b1acc55f8a468a8927a56cf4732230/src/datasets/io/json.py#L100 is giving the problem when we use to_json after select. This is when indices parameter in query_table is not None (if it is None then to_json should work as expected)

In order to circumvent this problem, I found out instead of doing Arrow Table -> Pandas-> JSON we can directly go to JSON by using to_pydict() which is a little slower than the current approach but at least select works properly now. Lmk what you guys think of it @lhoestq, @eladsegal?

lhoestq commented 2 years ago

Sounds good to me ! Feel free to also share your benchmarks for reference @bhavitvyamalik

bhavitvyamalik commented 2 years ago

Posting it in @eladsegal's format:

For squad: Saving examples using current to_json in 3.63 secs Saving examples to from_select1_fast.json in 5.00 secs Saving examples to from_select2_fast.json in 2.45 secs Saving examples to from_select3_fast.json in 2.50 secs

For squad_v2: Saving examples using current to_json in 5.26 secs Saving examples to from_select1_fast.json in 7.54 secs Saving examples to from_select2_fast.json in 3.80 secs Saving examples to from_select3_fast.json in 3.67 secs