Closed mitchellgordon95 closed 4 years ago
I think the slowness here probably come from the fact that we are copying from and to python.
@lhoestq for all the select
-based methods I think we should stay in Arrow format and update the writer so that it can accept Arrow tables or batches as well. What do you think?
@lhoestq for all the
select
-based methods I think we should stay in Arrow format and update the writer so that it can accept Arrow tables or batches as well. What do you think?
I just tried with writer.write_table
with tables of 1000 elements and it's slower that the solution in #405
On my side (select 10 000 examples):
I'll try with arrays and record batches to see if we can make it work.
I tried using .take
from pyarrow recordbatches but it doesn't improve the speed that much:
import nlp
import numpy as np
dset = nlp.Dataset.from_file("dummy_test_select.arrow") # dummy dataset with 100000 examples like {"a": "h"*512}
indices = np.random.randint(0, 100_000, 1000_000)
%%time
batch_size = 10_000
writer = ArrowWriter(schema=dset.schema, path="dummy_path",
writer_batch_size=1000, disable_nullable=False)
for i in tqdm(range(0, len(indices), batch_size)):
table = pa.concat_tables(dset._data.slice(int(i), 1) for i in indices[i : min(len(indices), i + batch_size)])
batch = table.to_pydict()
writer.write_batch(batch)
writer.finalize()
# 9.12s
%%time
batch_size = 10_000
writer = ArrowWriter(schema=dset.schema, path="dummy_path",
writer_batch_size=1000, disable_nullable=False)
for i in tqdm(range(0, len(indices), batch_size)):
batch_indices = indices[i : min(len(indices), i + batch_size)]
# First, extract only the indices that we need with a mask
mask = [False] * len(dset)
for k in batch_indices:
mask[k] = True
t_batch = dset._data.filter(pa.array(mask))
# Second, build the list of indices for the filtered table, and taking care of duplicates
rev_positions = {}
duplicates = 0
for i, j in enumerate(sorted(batch_indices)):
if j in rev_positions:
duplicates += 1
else:
rev_positions[j] = i - duplicates
rev_map = [rev_positions[j] for j in batch_indices]
# Third, use `.take` from the combined recordbatch
t_combined = t_batch.combine_chunks() # load in memory
recordbatch = t_combined.to_batches()[0]
table = pa.Table.from_arrays(
[recordbatch[c].take(pa.array(rev_map)) for c in range(len(dset._data.column_names))],
schema=writer.schema
)
writer.write_table(table)
writer.finalize()
# 3.2s
Shuffling is now significantly faster thanks to #513 Feel free to play with it now :)
Closing this one, but feel free to re-open if you have other questions
Shuffling is now significantly faster thanks to #513 Feel free to play with it now :)
Closing this one, but feel free to re-open if you have other questions
I have a similar issue. My code is
for batch_num in range(num_batches):
print(f'--> {batch_num=}\n') if verbose else None
# - Get batch
shuffled_dataset = dataset.shuffle(buffer_size=buffer_size, seed=seed)
raw_text_batch = shuffled_dataset.take(batch_size)
tokenized_batch = map(raw_text_batch)
if verbose:
time_start = time.time()
print(f'{raw_text_batch=}')
print(f'{tokenized_batch=}')
print(f'{next(iter(raw_text_batch))=}')
print(f'{next(iter(tokenized_batch))=}')
print(f'Time it took: {time.time() - time_start} seconds \a\n')
without the suffle it takes 4.5 secs with it takes 87.1 secs. Is this difference expected? my dataset version is:
(beyond_scale) brando9@ampere1:~/beyond-scale-language-data-diversity$ pip list | grep dataset
datasets 2.14.3
@lhoestq thoughts?
still slow even with update to 2.14.4
most recent as of this writing
Time it took: 4.301205635070801 seconds
--> batch_num=0
raw_text_batch=<datasets.iterable_dataset.IterableDataset object at 0x7f1fea7c2a40>
tokenized_batch=<datasets.iterable_dataset.IterableDataset object at 0x7f1fea7c2f20>
next(iter(raw_text_batch))={'text': "No matter which style you choose, you can be sure of one thing: our quality and craftsmanship are the best in the business. It's who we are and what we believe in. And it's evident every day on the factory floor where our dedicated teams take great pride in every stitch.", 'timestamp': '2019-04-19T06:43:50Z', 'url': 'https://institchescustoms.com/katzkin.html'}
next(iter(tokenized_batch))={'input_ids': tensor([ 2949, 2300, 543, 3918, 345, 3853, 11, 345, 460, 307,
1654, 286, 530, 1517, 25, 674, 3081, 290, 5977, 49820,
389, 262, 1266, 287, 262, 1597, 13, 632, 338, 508,
356, 389, 290, 644, 356, 1975, 287, 13, 843, 340,
338, 10678, 790, 1110, 319, 262, 8860, 4314, 810, 674,
7256, 3466, 1011, 1049, 11293, 287, 790, 24695, 13, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])}
Time it took: 102.99258613586426 seconds
Shuffling leads to doing random access in many different locations on disk which is slower than reading contiguous data.
There is a super fast approximate shuffling algorithm implemented for iterable datasets though:
iterable_dataset = dataset.to_iterable_dataset(num_shards=1024)
shuffled_dataset = iterable_dataset.shuffle(buffer_size=1000)
(the first batch might be a bit slow to get because the algorithm first fills a buffer before returning the first batch, see the docs for more info)
Consider shuffling bookcorpus:
According to tqdm, this will take around 2.5 hours on my machine to complete (even with the faster version of select from #405). I've also tried with
keep_in_memory=True
andwriter_batch_size=1000
.But I can also just write the lines to a text file:
Which completes in a couple minutes, followed by
shuf tmp.txt > tmp2.txt
which completes in under a minute. And finally,Which completes in under 10 minutes. I read up on Apache Arrow this morning, and it seems like the columnar data format is not especially well-suited to shuffling rows, since moving items around requires a lot of book-keeping.
Is shuffle inherently slow, or am I just using it wrong? And if it is slow, would it make sense to try converting the data to a row-based format on disk and then shuffling? (Instead of calling select with a random permutation, as is currently done.)