huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.05k stars 2.64k forks source link

Datasets performance slow? - 6.4x slower than in memory dataset #708

Closed eugeneware closed 3 years ago

eugeneware commented 3 years ago

I've been very excited about this amazing datasets project. However, I've noticed that the performance can be substantially slower than using an in-memory dataset.

Now, this is expected I guess, due to memory mapping data using arrow files, and you don't get anything for free. But I was surprised at how much slower.

For example, in the yelp_polarity dataset (560000 datapoints, or 17500 batches of 32), it was taking me 3:31 to just get process the data and get it on the GPU (no model involved). Whereas, the equivalent in-memory dataset would finish in just 0:33.

Is this expected? Given that one of the goals of this project is also accelerate dataset processing, this seems a bit slower than I would expect. I understand the advantages of being able to work on datasets that exceed memory, and that's very exciting to me, but thought I'd open this issue to discuss.

For reference I'm running a AMD Ryzen Threadripper 1900X 8-Core Processor CPU, with 128 GB of RAM and an NVME SSD Samsung 960 EVO. I'm running with an RTX Titan 24GB GPU.

I can see with iotop that the dataset gets quickly loaded into the system read buffers, and thus doesn't incur any additional IO reads. Thus in theory, all the data should be in RAM, but in my benchmark code below it's still 6.4 times slower.

What am I doing wrong? And is there a way to force the datasets to completely load into memory instead of being memory mapped in cases where you want maximum performance?

At 3:31 for 17500 batches, that's 12ms per batch. Does this 12ms just become insignificant as a proportion of forward and backward passes in practice, and thus it's not worth worrying about this in practice?

In any case, here's my code benchmark.py. If you run it with an argument of memory it will copy the data into memory before executing the same test.

import sys
from datasets import load_dataset
from transformers import DataCollatorWithPadding, BertTokenizerFast
from torch.utils.data import DataLoader
from tqdm import tqdm

if __name__ == '__main__':
    tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
    collate_fn = DataCollatorWithPadding(tokenizer, padding=True)

    ds = load_dataset('yelp_polarity')

    def do_tokenize(x):
        return tokenizer(x['text'], truncation=True)

    ds = ds.map(do_tokenize, batched=True)
    ds.set_format('torch', ['input_ids', 'token_type_ids', 'attention_mask'])

    if len(sys.argv) == 2 and sys.argv[1] == 'memory':
        # copy to memory - probably a faster way to do this - but demonstrates the point
        # approximately 530 batches per second - 17500 batches in 0:33
        print('using memory')
        _ds = [data for data in tqdm(ds['train'])]
    else:
        # approximately 83 batches per second - 17500 batches in 3:31
        print('using datasets')
        _ds = ds['train']

    dl = DataLoader(_ds, shuffle=True, collate_fn=collate_fn, batch_size=32, num_workers=4)

    for data in tqdm(dl):
        for k, v in data.items():
            data[k] = v.to('cuda')

For reference, my conda environment is here

Once again, I'm very excited about this library, and how easy it is to load datasets, and to do so without worrying about system memory constraints.

Thanks for all your great work.

lersouza commented 3 years ago

Facing a similar issue here. My model using SQuAD dataset takes about 1h to process with in memory data and more than 2h with datasets directly.

thomwolf commented 3 years ago

And if you use in-memory-data with datasets with load_dataset(..., keep_in_memory=True)?

lersouza commented 3 years ago

Thanks for the tip @thomwolf ! I did not see that flag in the docs. I'll try with that.

thomwolf commented 3 years ago

We should add it indeed and also maybe a specific section with all the tips for maximal speed. What do you think @lhoestq @SBrandeis @yjernite ?

lhoestq commented 3 years ago

By default the datasets loaded with load_dataset live on disk. It's possible to load them in memory by using some transforms like .map(..., keep_in_memory=True).

Small correction to @thomwolf 's comment above: currently we don't have the keep_in_memory parameter for load_dataset AFAIK but it would be nice to add it indeed :)

thomwolf commented 3 years ago

Yes indeed we should add it!

lersouza commented 3 years ago

Great! Thanks a lot.

I did a test using map(..., keep_in_memory=True) and also a test using in-memory only data.

features = dataset.map(tokenize, batched=True, remove_columns=dataset['train'].column_names)
features.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])

features_in_memory = dataset.map(tokenize, batched=True, keep_in_memory=True, remove_columns=dataset['train'].column_names)
features_in_memory.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])

in_memory = [features['train'][i] for i in range(len(features['train']))]

For using the features without any tweak, I got 1min17s for copying the entire DataLoader to CUDA:

%%time

for i, batch in enumerate(DataLoader(features['train'], batch_size=16, num_workers=4)):
    batch['input_ids'].to(device)

For using the features mapped with keep_in_memory=True, I also got 1min17s for copying the entire DataLoader to CUDA:

%%time

for i, batch in enumerate(DataLoader(features_in_memory['train'], batch_size=16, num_workers=4)):
    batch['input_ids'].to(device)

And for the case using every element in memory, converted from the original dataset, I got 12.5s:

%%time

for i, batch in enumerate(DataLoader(in_memory, batch_size=16, num_workers=4)):
    batch['input_ids'].to(device)

Taking a closer look in my SQuAD code, using a profiler, I see a lot of calls to posix read api. It seems that it is really reliying on disk, which results in a very high train time.

gaceladri commented 3 years ago

I am having the same issue here. When loading from memory I can get the GPU up to 70% util but when loading after mapping I can only get 40%.

In disk:

book_corpus = load_dataset('bookcorpus', 'plain_text', cache_dir='/home/ad/Desktop/bookcorpus', split='train[:20%]')
book_corpus = book_corpus.map(encode, batched=True, num_proc=20, load_from_cache_file=True, batch_size=2500)
book_corpus.set_format(type='torch', columns=['text', "input_ids", "attention_mask", "token_type_ids"])

training_args = TrainingArguments(
    output_dir="./mobile_bert_big",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    save_steps=50,
    save_total_limit=2,
    logging_first_step=True,
    warmup_steps=100,
    logging_steps=50,
    eval_steps=100,
    no_cuda=False,
    gradient_accumulation_steps=16,
    fp16=True)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=book_corpus,
    tokenizer=tokenizer)

In disk I can only get 0,17 it/s: [ 13/28907 01:03 < 46:03:27, 0.17 it/s, Epoch 0.00/1]

If I load it with torch.utils.data.Dataset()

class BCorpusDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = [torch.tensor(val[idx]) for key, val in self.encodings.items()][0]
        return item

    def __len__(self):
        length = [len(val) for key, val in self.encodings.items()][0]
        return length

**book_corpus = book_corpus.select([i for i in range(16*2000)])** # filtering to not have 20% of BC in memory...
book_corpus = book_corpus(book_corpus)

I can get: [ 5/62 00:09 < 03:03, 0.31 it/s, Epoch 0.06/1]

But obviously I can not get BookCorpus in memory xD

EDIT: it is something weird. If i load in disk 1% of bookcorpus:

book_corpus = load_dataset('bookcorpus', 'plain_text', cache_dir='/home/ad/Desktop/bookcorpus', split='train[:1%]')

I can get 0.28 it/s, (the same that in memory) but if I load 20% of bookcorpus:

book_corpus = load_dataset('bookcorpus', 'plain_text', cache_dir='/home/ad/Desktop/bookcorpus', split='train[:20%]')

I get again 0.17 it/s.

I am missing something? I think it is something related to size, and not disk or in-memory.

gaceladri commented 3 years ago

There is a way to increase the batches read from memory? or multiprocessed it? I think that one of two or it is reading with just 1 core o it is reading very small chunks from disk and left my GPU at 0 between batches

gaceladri commented 3 years ago

My fault! I had not seen the dataloader_num_workers in TrainingArguments ! Now I can parallelize and go fast! Sorry, and thanks.