Open jarednielsen opened 4 years ago
Update: I think I've found a solution.
output_types = {"input_ids": tf.int64, "token_type_ids": tf.int64, "attention_mask": tf.int64}
def train_dataset_gen():
for i in range(len(train_dataset)):
yield train_dataset[i]
tf_dataset = tf.data.Dataset.from_generator(train_dataset_gen, output_types=output_types)
loads WikiText-2 in 20 ms, and WikiText-103 in 20 ms. It appears to be lazily loading via indexing train_dataset.
Yes this is the current best solution. We should probably show it in the tutorial notebook.
Note that this solution unfortunately doesn't allow to train on TPUs (yet). See #193
This approach still seems quite slow. When using TFRecords with a similar training loop, I get ~3.0-3.5 it/s on multi-node, multi-GPU training. I notice a pretty severe performance regression when scaling, with observed performance numbers. Since the allreduce step takes less than 100ms/it and I've achieved 80% scaling efficiency up to 64 GPUs, it must be the data pipeline.
Nodes | GPUs | Iterations/Second |
---|---|---|
1 | 2 | 2.01 |
1 | 8 | 0.81 |
2 | 16 | 0.37 |
Here are performance metrics over 10k steps. The iteration speed appears to follow some sort of caching pattern. I would love to use nlp
in my project, but a slowdown from 3.0 it/s to 0.3 it/s is too great to stomach.
An interesting alternative to investigate here would be to use the tf.io library which has some support for Arrow to TF conversion: https://www.tensorflow.org/io/api_docs/python/tfio/arrow/ArrowDataset
There are quite a few types supported, including lists so if the unsupported columns are dropped then we could maybe have a zero-copy mapping from Arrow to TensorFlow, including tokenized inputs and 1D tensors like the ones we mostly use in NLP: https://github.com/tensorflow/io/blob/322b3170c43ecac5c6af9e39dbd18fd747913e5a/tensorflow_io/arrow/python/ops/arrow_dataset_ops.py#L44-L72
Here is an introduction on Arrow to TF using tf.io: https://medium.com/tensorflow/tensorflow-with-apache-arrow-datasets-cdbcfe80a59f
Interesting. There's no support for strings, but it does enable int and floats so that would work for tokenized inputs.
ArrowStreamDataset requires loading from a "record batch iterator", which can be instantiated from in-memory arrays as described here: https://arrow.apache.org/docs/python/ipc.html.
But the nlp.Dataset stores its data as a pyarrow.lib.Table
, and the underlying features are pyarrow.lib.ChunkedArray
. I can't find any documentation about lazily creating a record batch iterator from a ChunkedArray or a Table. Have you had any success?
I can't find any uses of tfio.arrow.ArrowDataset on GitHub.
You can use to_batches
maybe?
https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_batches
Also note that since #322 it is now possible to do
ids = [1, 10, 42, 100]
batch = dataset[ids]
From my experience it is quite fast but it can take lots of memory for large batches (haven't played that much with it). Let me know if you think there could be a better way to implement it. (current code is here)
Thanks @lhoestq! That format is much better to work with.
I put together a benchmarking script. This doesn't measure the CPU-to-GPU efficiency, nor how it scales with multi-GPU multi-node training where many processes are making the same demands on the same dataset. But it does show some interesting results:
import nlp
import numpy as np
import tensorflow as tf
import time
dset = nlp.load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
dset = dset.filter(lambda ex: len(ex["text"]) > 0)
bsz = 1024
n_batches = 100
def single_item_gen():
for i in range(len(dset)):
yield dset[i]
def sequential_batch_gen():
for i in range(0, len(dset), bsz):
yield dset[i:i+bsz]
def random_batch_gen():
for i in range(len(dset)):
indices = list(np.random.randint(len(dset), size=(bsz,)))
yield dset[indices]
output_types = {"text": tf.string}
single_item = tf.data.Dataset.from_generator(single_item_gen, output_types=output_types).batch(bsz)
interleaved = tf.data.Dataset.range(10).interleave(
lambda idx: tf.data.Dataset.from_generator(single_item_gen, output_types=output_types),
cycle_length=10,
)
sequential_batch = tf.data.Dataset.from_generator(sequential_batch_gen, output_types=output_types)
random_batch = tf.data.Dataset.from_generator(random_batch_gen, output_types=output_types)
def iterate(tf_dset):
start = time.perf_counter()
for i, batch in enumerate(tf_dset.take(n_batches)):
pass
elapsed = time.perf_counter() - start
print(f"{tf_dset} took {elapsed:.3f} secs")
iterate(single_item)
iterate(interleaved)
iterate(sequential_batch)
iterate(random_batch)
Results:
<BatchDataset shapes: {text: <unknown>}, types: {text: tf.string}> took 23.005 secs
<InterleaveDataset shapes: {text: <unknown>}, types: {text: tf.string}> took 0.135 secs
<FlatMapDataset shapes: {text: <unknown>}, types: {text: tf.string}> took 0.074 secs
<FlatMapDataset shapes: {text: <unknown>}, types: {text: tf.string}> took 0.550 secs
Hey @jarednielsen
Thanks for this very interesting analysis!! IMHO to read text data one should use tf.data.TextLineDataset
. It would be interesting to compare what you have done with simply load with a TextLineDataset
and see if there is a difference.
A good example can be found here https://www.tensorflow.org/tutorials/load_data/text
Thanks! I'm not actually loading in raw text data, that was just the synthetic data I created for this benchmark. A more realistic use case would be a dataset of tokenized examples, which would be a dict of lists of integers. TensorFlow's TextLineDataset greedily loads the dataset into the graph itself, which can lead to out-of-memory errors - one of the main reason I'm so drawn to the nlp
library is its zero-copy no-RAM approach to dataset loading and mapping.
It's quite helpful for running a preprocessing pipeline - a sample ELECTRA pipeline I've built is here: https://github.com/jarednielsen/deep-learning-models/blob/nlp/models/nlp/common/preprocess.py.
Sorry, I think I badly expressed myself, my bad. What I suggested is to compare with the usual loading textual data in pure TF with TextLineDataset
with nlp
. I know it is not recommended with very large datasets to use it, but I was curious to see how it behaves compared to a processing with nlp
on smaller datasets.
BTW your script looks very interesting, thanks for sharing!!
I'm training on large datasets such as Wikipedia and BookCorpus. Following the instructions in the tutorial notebook, I see the following recommended for TensorFlow:
This code works for something like WikiText-2. However, scaling up to WikiText-103, the last line takes 5-10 minutes to run. I assume it is because tf.data.Dataset.from_tensor_slices() is pulling everything into memory, not lazily loading. This approach won't scale up to datasets 25x larger such as Wikipedia.
So I tried manual batching using
dataset.select()
:This appears to create a new Apache Arrow dataset with every batch I grab, and then tries to cache it. The runtime of
dataset.select([0, 1])
appears to be much worse thandataset[:2]
. So usingselect()
doesn't seem to be performant enough for a training loop.Is there a performant scalable way to lazily load batches of nlp Datasets?