huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.98k stars 2.62k forks source link

Querying examples from big datasets is slower than small datasets #1803

Closed lhoestq closed 3 years ago

lhoestq commented 3 years ago

After some experiments with bookcorpus I noticed that querying examples from big datasets is slower than small datasets. For example

from datasets import load_dataset

b1 = load_dataset("bookcorpus", split="train[:1%]")
b50 = load_dataset("bookcorpus", split="train[:50%]")
b100 = load_dataset("bookcorpus", split="train[:100%]")

%timeit _ = b1[-1]                                                                     
# 12.2 µs ± 70.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit _ = b50[-1]                                                                    
# 92.5 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit _ = b100[-1]                                                                      
# 177 µs ± 3.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

It looks like the time to fetch the example increases with the size of the dataset.

This is maybe due to the use of the Arrow streaming format to store the data on disk. I guess pyarrow needs to iterate through the file as a stream to find the queried sample.

Maybe switching to the Arrow IPC file format could help fixing this issue.

Indeed according to the documentation, it's identical to the streaming format except that it contains the memory offsets of each sample, which could fix the issue:

We define a “file format” supporting random access that is build with the stream format. The file starts and ends with a magic string ARROW1 (plus padding). What follows in the file is identical to the stream format. At the end of the file, we write a footer containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and sizes for each of the data blocks in the file. This enables random access any record batch in the file. See File.fbs for the precise details of the file footer.

cc @gaceladri since it can help speed up your training when this one is fixed.

ink-pad commented 3 years ago

Hello, @lhoestq / @gaceladri : We have been seeing similar behavior with bigger datasets, where querying time increases. Are you folks aware of any solution that fixes this problem yet?

lhoestq commented 3 years ago

Hi ! I'm pretty sure that it can be fixed by using the Arrow IPC file format instead of the raw streaming format but I haven't tested yet. I'll take a look at it soon and let you know

gaceladri commented 3 years ago

My workaround is to shard the dataset into splits in my ssd disk and feed the data in different training sessions. But it is a bit of a pain when we need to reload the last training session with the rest of the split with the Trainer in transformers.

I mean, when I split the training and then reloads the model and optimizer, it not gets the correct global_status of the optimizer, so I need to hardcode some things. I'm planning to open an issue in transformers and think about it.

from datasets import load_dataset

book_corpus = load_dataset("bookcorpus", split="train[:25%]")
wikicorpus = load_dataset("wikicorpus", split="train[:25%]")
openwebtext = load_dataset("openwebtext", split="train[:25%]")

big_dataset = datasets.concatenate_datasets([wikicorpus, openwebtext, book_corpus])
big_dataset.shuffle(seed=42)
big_dataset = big_dataset.map(encode, batched=True, num_proc=20, load_from_cache_file=True, writer_batch_size=5000)
big_dataset.set_format(type='torch', columns=["text", "input_ids", "attention_mask", "token_type_ids"])

training_args = TrainingArguments(
    output_dir="./linear_bert",
    overwrite_output_dir=True,
    per_device_train_batch_size=71,
    save_steps=500,
    save_total_limit=10,
    logging_first_step=True,
    logging_steps=100,
    gradient_accumulation_steps=9,
    fp16=True,
    dataloader_num_workers=20,
    warmup_steps=24000,
    learning_rate=0.000545205002870214,
    adam_epsilon=1e-6,
    adam_beta2=0.98,
    weight_decay=0.01,
    max_steps=138974,  # the total number of steps after concatenating 100% datasets
    max_grad_norm=1.0,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=big_dataset,
    tokenizer=tokenizer))

I do one training pass with the total steps of this shard and I use len(bbig)/batchsize to stop the training (hardcoded in the trainer.py) when I pass over all the examples in this split.

Now Im working, I will edit the comment with a more elaborated answer when I left the work.

lhoestq commented 3 years ago

I just tested and using the Arrow File format doesn't improve the speed... This will need further investigation.

My guess is that it has to iterate over the record batches or chunks of a ChunkedArray in order to retrieve elements.

However if we know in advance in which chunk the element is, and at what index it is, then we can access it instantaneously. But this requires dealing with the chunked arrays instead of the pyarrow Table directly which is not practical.

abisee commented 3 years ago

I have a dataset with about 2.7 million rows (which I'm loading via load_from_disk), and I need to fetch around 300k (particular) rows of it, by index. Currently this is taking a really long time (~8 hours). I tried sharding the large dataset but overall it doesn't change how long it takes to fetch the desired rows.

I actually have enough RAM that I could fit the large dataset in memory. Would having the large dataset in memory speed up querying? To find out, I tried to load (a column of) the large dataset into memory like this:

column_data = large_ds['column_name']

but in itself this takes a really long time.

I'm pretty stuck - do you have any ideas what I should do?

lhoestq commented 3 years ago

Hi ! Feel free to post a message on the forum. I'd be happy to help you with this.

In your post on the forum, feel free to add more details about your setup: What are column names and types of your dataset ? How was the dataset constructed ? Is the dataset shuffled ? Is the dataset tokenized ? Are you on a SSD or an HDD ?

I'm sure we can figure something out. For example on my laptop I can access the 6 millions articles from wikipedia in less than a minute.

abisee commented 3 years ago

Thanks @lhoestq, I've posted on the forum.

albertvillanova commented 3 years ago

Fixed by #2122.