huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.25k stars 2.69k forks source link

Faster Shuffling? #406

Closed mitchellgordon95 closed 4 years ago

mitchellgordon95 commented 4 years ago

Consider shuffling bookcorpus:

dataset = nlp.load_dataset('bookcorpus', split='train')
dataset.shuffle()

According to tqdm, this will take around 2.5 hours on my machine to complete (even with the faster version of select from #405). I've also tried with keep_in_memory=True and writer_batch_size=1000.

But I can also just write the lines to a text file:

batch_size = 100000
with open('tmp.txt', 'w+') as out_f:
    for i in tqdm(range(0, len(dataset), batch_size)):
        batch = dataset[i:i+batch_size]['text']
        print("\n".join(batch), file=out_f)

Which completes in a couple minutes, followed by shuf tmp.txt > tmp2.txt which completes in under a minute. And finally,

dataset = nlp.load_dataset('text', data_files='tmp2.txt')

Which completes in under 10 minutes. I read up on Apache Arrow this morning, and it seems like the columnar data format is not especially well-suited to shuffling rows, since moving items around requires a lot of book-keeping.

Is shuffle inherently slow, or am I just using it wrong? And if it is slow, would it make sense to try converting the data to a row-based format on disk and then shuffling? (Instead of calling select with a random permutation, as is currently done.)

thomwolf commented 4 years ago

I think the slowness here probably come from the fact that we are copying from and to python.

@lhoestq for all the select-based methods I think we should stay in Arrow format and update the writer so that it can accept Arrow tables or batches as well. What do you think?

lhoestq commented 4 years ago

@lhoestq for all the select-based methods I think we should stay in Arrow format and update the writer so that it can accept Arrow tables or batches as well. What do you think?

I just tried with writer.write_table with tables of 1000 elements and it's slower that the solution in #405

On my side (select 10 000 examples):

I'll try with arrays and record batches to see if we can make it work.

lhoestq commented 4 years ago

I tried using .take from pyarrow recordbatches but it doesn't improve the speed that much:

import nlp
import numpy as np

dset = nlp.Dataset.from_file("dummy_test_select.arrow")  # dummy dataset with 100000 examples like {"a": "h"*512}
indices = np.random.randint(0, 100_000, 1000_000)
%%time
batch_size = 10_000
writer = ArrowWriter(schema=dset.schema, path="dummy_path",
                     writer_batch_size=1000, disable_nullable=False)
for i in tqdm(range(0, len(indices), batch_size)):
    table = pa.concat_tables(dset._data.slice(int(i), 1) for i in indices[i : min(len(indices), i + batch_size)])
    batch = table.to_pydict()
    writer.write_batch(batch)
writer.finalize()
# 9.12s
%%time
batch_size = 10_000
writer = ArrowWriter(schema=dset.schema, path="dummy_path", 
                     writer_batch_size=1000, disable_nullable=False)
for i in tqdm(range(0, len(indices), batch_size)):
    batch_indices = indices[i : min(len(indices), i + batch_size)]
    # First, extract only the indices that we need with a mask
    mask = [False] * len(dset)
    for k in batch_indices:
        mask[k] = True
    t_batch = dset._data.filter(pa.array(mask))
    # Second, build the list of indices for the filtered table, and taking care of duplicates
    rev_positions = {}
    duplicates = 0
    for i, j in enumerate(sorted(batch_indices)):
        if j in rev_positions:
            duplicates += 1
        else:
            rev_positions[j] = i - duplicates
    rev_map = [rev_positions[j] for j in batch_indices]
    # Third, use `.take` from the combined recordbatch
    t_combined = t_batch.combine_chunks()  # load in memory
    recordbatch = t_combined.to_batches()[0]
    table = pa.Table.from_arrays(
        [recordbatch[c].take(pa.array(rev_map)) for c in range(len(dset._data.column_names))],
        schema=writer.schema
    )
    writer.write_table(table)
writer.finalize()
# 3.2s
lhoestq commented 4 years ago

Shuffling is now significantly faster thanks to #513 Feel free to play with it now :)

Closing this one, but feel free to re-open if you have other questions

brando90 commented 1 year ago

Shuffling is now significantly faster thanks to #513 Feel free to play with it now :)

Closing this one, but feel free to re-open if you have other questions

I have a similar issue. My code is

    for batch_num in range(num_batches):
        print(f'--> {batch_num=}\n') if verbose else None
        # - Get batch
        shuffled_dataset = dataset.shuffle(buffer_size=buffer_size, seed=seed)
        raw_text_batch = shuffled_dataset.take(batch_size)
        tokenized_batch = map(raw_text_batch)
        if verbose:
            time_start = time.time()
            print(f'{raw_text_batch=}')
            print(f'{tokenized_batch=}')
            print(f'{next(iter(raw_text_batch))=}')
            print(f'{next(iter(tokenized_batch))=}')
            print(f'Time it took: {time.time() - time_start} seconds \a\n')

without the suffle it takes 4.5 secs with it takes 87.1 secs. Is this difference expected? my dataset version is:

(beyond_scale) brando9@ampere1:~/beyond-scale-language-data-diversity$ pip list | grep dataset
datasets                             2.14.3

@lhoestq thoughts?

brando90 commented 1 year ago

still slow even with update to 2.14.4 most recent as of this writing

Time it took: 4.301205635070801 seconds 

--> batch_num=0

raw_text_batch=<datasets.iterable_dataset.IterableDataset object at 0x7f1fea7c2a40>
tokenized_batch=<datasets.iterable_dataset.IterableDataset object at 0x7f1fea7c2f20>
next(iter(raw_text_batch))={'text': "No matter which style you choose, you can be sure of one thing: our quality and craftsmanship are the best in the business. It's who we are and what we believe in. And it's evident every day on the factory floor where our dedicated teams take great pride in every stitch.", 'timestamp': '2019-04-19T06:43:50Z', 'url': 'https://institchescustoms.com/katzkin.html'}
next(iter(tokenized_batch))={'input_ids': tensor([ 2949,  2300,   543,  3918,   345,  3853,    11,   345,   460,   307,
         1654,   286,   530,  1517,    25,   674,  3081,   290,  5977, 49820,
          389,   262,  1266,   287,   262,  1597,    13,   632,   338,   508,
          356,   389,   290,   644,   356,  1975,   287,    13,   843,   340,
          338, 10678,   790,  1110,   319,   262,  8860,  4314,   810,   674,
         7256,  3466,  1011,  1049, 11293,   287,   790, 24695,    13, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])}
Time it took: 102.99258613586426 seconds
lhoestq commented 1 year ago

Shuffling leads to doing random access in many different locations on disk which is slower than reading contiguous data.

There is a super fast approximate shuffling algorithm implemented for iterable datasets though:

iterable_dataset = dataset.to_iterable_dataset(num_shards=1024)
shuffled_dataset = iterable_dataset.shuffle(buffer_size=1000)

(the first batch might be a bit slow to get because the algorithm first fills a buffer before returning the first batch, see the docs for more info)