huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

[Question] Best way to batch a large dataset? #315

Open jarednielsen opened 4 years ago

jarednielsen commented 4 years ago

I'm training on large datasets such as Wikipedia and BookCorpus. Following the instructions in the tutorial notebook, I see the following recommended for TensorFlow:

train_tf_dataset = train_tf_dataset.filter(remove_none_values, load_from_cache_file=False)
columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
train_tf_dataset.set_format(type='tensorflow', columns=columns)
features = {x: train_tf_dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.max_len]) for x in columns[:3]} 
labels = {"output_1": train_tf_dataset["start_positions"].to_tensor(default_value=0, shape=[None, 1])}
labels["output_2"] = train_tf_dataset["end_positions"].to_tensor(default_value=0, shape=[None, 1])
### Question about this last line ###
tfdataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(8)

This code works for something like WikiText-2. However, scaling up to WikiText-103, the last line takes 5-10 minutes to run. I assume it is because tf.data.Dataset.from_tensor_slices() is pulling everything into memory, not lazily loading. This approach won't scale up to datasets 25x larger such as Wikipedia.

So I tried manual batching using dataset.select():

idxs = np.random.randint(len(dataset), size=bsz)
batch = dataset.select(idxs).map(lambda example: {"input_ids": tokenizer(example["text"])})
tf_batch = tf.constant(batch["ids"], dtype=tf.int64)

This appears to create a new Apache Arrow dataset with every batch I grab, and then tries to cache it. The runtime of dataset.select([0, 1]) appears to be much worse than dataset[:2]. So using select() doesn't seem to be performant enough for a training loop.

Is there a performant scalable way to lazily load batches of nlp Datasets?

jarednielsen commented 4 years ago

Update: I think I've found a solution.

output_types = {"input_ids": tf.int64, "token_type_ids": tf.int64, "attention_mask": tf.int64}
def train_dataset_gen():
    for i in range(len(train_dataset)):
        yield train_dataset[i]
tf_dataset = tf.data.Dataset.from_generator(train_dataset_gen, output_types=output_types)

loads WikiText-2 in 20 ms, and WikiText-103 in 20 ms. It appears to be lazily loading via indexing train_dataset.

lhoestq commented 4 years ago

Yes this is the current best solution. We should probably show it in the tutorial notebook.

Note that this solution unfortunately doesn't allow to train on TPUs (yet). See #193

jarednielsen commented 4 years ago

This approach still seems quite slow. When using TFRecords with a similar training loop, I get ~3.0-3.5 it/s on multi-node, multi-GPU training. I notice a pretty severe performance regression when scaling, with observed performance numbers. Since the allreduce step takes less than 100ms/it and I've achieved 80% scaling efficiency up to 64 GPUs, it must be the data pipeline.

Nodes GPUs Iterations/Second
1 2 2.01
1 8 0.81
2 16 0.37

Here are performance metrics over 10k steps. The iteration speed appears to follow some sort of caching pattern. I would love to use nlp in my project, but a slowdown from 3.0 it/s to 0.3 it/s is too great to stomach.

Screen Shot 2020-07-02 at 8 29 22 AM
thomwolf commented 4 years ago

An interesting alternative to investigate here would be to use the tf.io library which has some support for Arrow to TF conversion: https://www.tensorflow.org/io/api_docs/python/tfio/arrow/ArrowDataset

There are quite a few types supported, including lists so if the unsupported columns are dropped then we could maybe have a zero-copy mapping from Arrow to TensorFlow, including tokenized inputs and 1D tensors like the ones we mostly use in NLP: https://github.com/tensorflow/io/blob/322b3170c43ecac5c6af9e39dbd18fd747913e5a/tensorflow_io/arrow/python/ops/arrow_dataset_ops.py#L44-L72

Here is an introduction on Arrow to TF using tf.io: https://medium.com/tensorflow/tensorflow-with-apache-arrow-datasets-cdbcfe80a59f

jarednielsen commented 4 years ago

Interesting. There's no support for strings, but it does enable int and floats so that would work for tokenized inputs.

ArrowStreamDataset requires loading from a "record batch iterator", which can be instantiated from in-memory arrays as described here: https://arrow.apache.org/docs/python/ipc.html.

But the nlp.Dataset stores its data as a pyarrow.lib.Table, and the underlying features are pyarrow.lib.ChunkedArray. I can't find any documentation about lazily creating a record batch iterator from a ChunkedArray or a Table. Have you had any success?

I can't find any uses of tfio.arrow.ArrowDataset on GitHub.

thomwolf commented 4 years ago

You can use to_batches maybe? https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_batches

lhoestq commented 4 years ago

Also note that since #322 it is now possible to do

ids = [1, 10, 42, 100]
batch = dataset[ids]

From my experience it is quite fast but it can take lots of memory for large batches (haven't played that much with it). Let me know if you think there could be a better way to implement it. (current code is here)

jarednielsen commented 4 years ago

Thanks @lhoestq! That format is much better to work with.

I put together a benchmarking script. This doesn't measure the CPU-to-GPU efficiency, nor how it scales with multi-GPU multi-node training where many processes are making the same demands on the same dataset. But it does show some interesting results:

import nlp
import numpy as np
import tensorflow as tf
import time

dset = nlp.load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
dset = dset.filter(lambda ex: len(ex["text"]) > 0)
bsz = 1024
n_batches = 100

def single_item_gen():
    for i in range(len(dset)):
        yield dset[i]

def sequential_batch_gen():
    for i in range(0, len(dset), bsz):
        yield dset[i:i+bsz]

def random_batch_gen():
    for i in range(len(dset)):
        indices = list(np.random.randint(len(dset), size=(bsz,)))
        yield dset[indices]

output_types = {"text": tf.string}
single_item = tf.data.Dataset.from_generator(single_item_gen, output_types=output_types).batch(bsz)
interleaved = tf.data.Dataset.range(10).interleave(
    lambda idx: tf.data.Dataset.from_generator(single_item_gen, output_types=output_types),
    cycle_length=10,
)
sequential_batch = tf.data.Dataset.from_generator(sequential_batch_gen, output_types=output_types)
random_batch = tf.data.Dataset.from_generator(random_batch_gen, output_types=output_types)

def iterate(tf_dset):
    start = time.perf_counter()
    for i, batch in enumerate(tf_dset.take(n_batches)):
        pass
    elapsed = time.perf_counter() - start
    print(f"{tf_dset} took {elapsed:.3f} secs")

iterate(single_item)
iterate(interleaved)
iterate(sequential_batch)
iterate(random_batch)

Results:

<BatchDataset shapes: {text: <unknown>}, types: {text: tf.string}> took 23.005 secs
<InterleaveDataset shapes: {text: <unknown>}, types: {text: tf.string}> took 0.135 secs
<FlatMapDataset shapes: {text: <unknown>}, types: {text: tf.string}> took 0.074 secs
<FlatMapDataset shapes: {text: <unknown>}, types: {text: tf.string}> took 0.550 secs
jplu commented 4 years ago

Hey @jarednielsen

Thanks for this very interesting analysis!! IMHO to read text data one should use tf.data.TextLineDataset. It would be interesting to compare what you have done with simply load with a TextLineDataset and see if there is a difference.

A good example can be found here https://www.tensorflow.org/tutorials/load_data/text

jarednielsen commented 4 years ago

Thanks! I'm not actually loading in raw text data, that was just the synthetic data I created for this benchmark. A more realistic use case would be a dataset of tokenized examples, which would be a dict of lists of integers. TensorFlow's TextLineDataset greedily loads the dataset into the graph itself, which can lead to out-of-memory errors - one of the main reason I'm so drawn to the nlp library is its zero-copy no-RAM approach to dataset loading and mapping.

It's quite helpful for running a preprocessing pipeline - a sample ELECTRA pipeline I've built is here: https://github.com/jarednielsen/deep-learning-models/blob/nlp/models/nlp/common/preprocess.py.

jplu commented 4 years ago

Sorry, I think I badly expressed myself, my bad. What I suggested is to compare with the usual loading textual data in pure TF with TextLineDataset with nlp. I know it is not recommended with very large datasets to use it, but I was curious to see how it behaves compared to a processing with nlp on smaller datasets.

BTW your script looks very interesting, thanks for sharing!!