Using tf.data.Dataset has big overhead

Flamefire commented 4 years ago

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): RHEL 7.5
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.1.0
Python version: 3.7.4
CUDA/cuDNN version: 10.1
GPU model and memory: GTX 1080

Describe the current behavior

Using a Dataset reduces performance by a small but significant amount, ~7% for ImageNet like data

Describe the expected behavior

Using Dataset has no or only marginal performance impact

Standalone code to reproduce the issue

import tensorflow as tf
from timeit import timeit

@tf.function
def train_step(x, y):
    model.train_on_batch(x, y)

for useData in (True, False):
    model = tf.keras.applications.ResNet50(weights=None, classes=1000)
    model.compile(
        loss=tf.losses.SparseCategoricalCrossentropy(),
        optimizer=tf.keras.optimizers.SGD(),
        metrics=['accuracy'],
        experimental_run_tf_function=True)

    if useData:
        x = tf.random.uniform([1, 32, 224, 224, 3])
        y = tf.random.uniform([1, 32, 1], minval=0, maxval=999, dtype=tf.int64)
        dataset = tf.data.Dataset.from_tensor_slices((x, y)).repeat()

        def train(steps):
            for x, y in dataset.take(steps):
                train_step(x, y)
    else:
        x = tf.random.uniform([32, 224, 224, 3])
        y = tf.random.uniform([32, 1], minval=0, maxval=999, dtype=tf.int64)

        def train(steps):
            for _ in range(steps):
                train_step(x, y)

    # warmup
    train(2)
    t = timeit(lambda: train(50), number=10)
    print('useData: %s -> %s' % (useData, t))

Sample output: useData: True -> 89.92945478390902 useData: False -> 86.73652107780799

For more realistic training loops (e.g. including callbacks) the difference is even bigger. Some of my tests:

constant: total images/sec: 496.47 (calculation(497.53) + preprocessing(1.06)) 
dataset:  total images/sec: 465.09 (calculation(478.64) + preprocessing(13.55))

First number is calculated from training loop execution time (after warmup) the latter only the train-step and the difference (to the first number) which I called "preprocessing" as it is iterating over the dataset (calling next on the iterator by the for loop) and hence dominated by preprocessing functions if present (none here) including the repeat and take Dataset adapters.

So 2 conclusions: Getting elements from the iterator seems to be quite costly (1->13.6) and even the training loop itself gets slower (498 -> 479)

This would be a reason to avoid the dataset API.

Leslie-Fang commented 4 years ago

Hi @Flamefire Could you try some thing like this:

if useData:
    x = tf.random.uniform([1, 32, 224, 224, 3])
    y = tf.random.uniform([1, 32, 1], minval=0, maxval=999, dtype=tf.int64)
    dataset = tf.data.Dataset.from_tensor_slices((x, y)).repeat()

    def train(steps):
        dataset = dataset.batch(batch_size=steps)
        dataset = dataset.prefetch(1)
        for x, y in dataset:
            train_step(x, y)

What's the performance data?

Flamefire commented 4 years ago

Did another experiment: Adding @tf.function above the train function slows it down significantly and using the dataset is now twice as fast:

useData: True -> 165.09024090506136
useData: False -> 380.11968565313146

Not sure if that changes semantics though and it can't be easily done for the "real" code as some callbacks can't be run inside a tf.function

dataset = dataset.batch(batch_size=steps)

steps is the number of batches, not the batch_size. The batch_size is 32 (wanted to avoid the batch layer which likely makes it worse, hence that is included in the uniform call already)

Putting the prefetch(1) after the take (used to limit number of batches) it makes execution slighly slower, but can be called the same as difference is small: useData: True -> 88.12129278900102

Saduf2019 commented 4 years ago

@Flamefire I ran the code shared and face indentation errors, please provide complete code with all dependencies and indentation such that we could replicate the error faced. If possible please provide colab gist for us to analyse the error faced.

Flamefire commented 4 years ago

Yes, seems that I missed the code tags on Github, see the updated code above (the loop contents was not indented)

Saduf2019 commented 4 years ago

@Flamefire I ran the code shared and face a different error, please find the gist here

Flamefire commented 4 years ago

@Saduf2019 There seems to be an issue with your Colab instance and/or way of installing TF 2.1.0. I just verified this code locally on a system without a GPU and on our cluster with a GPU and it is working fine. Even if that error shown was be valid, it would be yet another bug of TF as the code is supposed to work according to the TF documentation (see "custom training loop")

Example why your Colab instance is erroneous: Part of the call stack is

    578         xla_context.Exit()
    579     else:
--> 580       result = self._call(*args, **kwds)
    581 
    582     if tracing_count == self._get_tracing_count():

The related code on my system and on Github is: https://github.com/tensorflow/tensorflow/blob/v2.1.0/tensorflow/python/eager/def_function.py#L568

As you can see the line numbers don't match so your Colab instance is using another version of TF

I restarted the colab you posted myself and it seems to be doing something, i.e. not immediately failing with an error. As it seems to be using CPU it's running for ages now. I'd suggest trying with a GPU where runtime is a few seconds.

Saduf2019 commented 4 years ago

@Flamefire I ran the code on gpu, please find the gist here and confirm if it replicated your issue.

Flamefire commented 4 years ago

@Saduf2019 Yes this seems to work. The speed difference in that colab is only a bit more than 1% though, so I guess the used GPU is rather slow (total time is ~200s where on my machine it is ~90s)

Seeing the difference from using tf.data.Dataset getting smaller the longer the GPU takes for training is expected so yes that replicates the issue.

aaudiber commented 4 years ago

@Flamefire The tf.data.Dataset example is slicing a 4D tensor into a 3D tensor (which requires copying the data every step), while the non-Dataset code starts with 3D tensors and therefore doesn't need to copy. To compare apples to apples here, you should define the Dataset data with

x = tf.random.uniform([32, 224, 224, 3])
y = tf.random.uniform([32, 1], minval=0, maxval=999, dtype=tf.int64)
dataset = tf.data.Dataset.from_tensors((x, y)).repeat()

Flamefire commented 4 years ago

@aaudiber Thanks for the suggestion. I'd very much expected the trivial copy to be completely overlapped by the computation of the not-so-small Resnet especially as prefetch was used with no difference

Tried your suggestion anyway:
With from_tensor_slices and prefetch(1):

useData: True -> 89.55927550140768
useData: False -> 87.00254090223461

With from_tensors and prefetch(1):

useData: True -> 88.65487134549767
useData: False -> 86.93021802790463

So you can see there is an effect but using the dataset is still slower especially as no real work is done by it.

For reference the update code:

import tensorflow as tf
from timeit import timeit

@tf.function
def train_step(x, y):
    model.train_on_batch(x, y)

for useData in (True, False):
    model = tf.keras.applications.ResNet50(weights=None, classes=1000)
    model.compile(
        loss=tf.losses.SparseCategoricalCrossentropy(),
        optimizer=tf.keras.optimizers.SGD(),
        metrics=['accuracy'],
        experimental_run_tf_function=True)

    if useData:
        x = tf.random.uniform([32, 224, 224, 3])
        y = tf.random.uniform([32, 1], minval=0, maxval=999, dtype=tf.int64)
        dataset = tf.data.Dataset.from_tensors((x, y)).repeat()

        def train(steps):
            for x, y in dataset.take(steps).prefetch(1):
                train_step(x, y)
    else:
        x = tf.random.uniform([32, 224, 224, 3])
        y = tf.random.uniform([32, 1], minval=0, maxval=999, dtype=tf.int64)

        def train(steps):
            for _ in range(steps):
                train_step(x, y)

    # warmup
    train(2)
    t = timeit(lambda: train(50), number=10)
    print('useData: %s -> %s' % (useData, t))

aaudiber commented 4 years ago

Thanks @Flamefire.

This is a difficult case for tf.data.Dataset because there isn't any preprocessing. tf.data.Dataset usually does preprocessing on the CPU, then transfers the data to the GPU afterward. The tf.data.Dataset example is slower because it is copying the tensors from GPU memory to CPU memory and back each time, while the non-Dataset example starts with the tensors on the GPU and doesn't need to move them at all since there isn't any preprocessing.

Ideally we could use tf.data.experimental.prefetch_to_device to prefetch to the GPU and recover the performance, but there is currently an outstanding bug with prefetch_to_device. Once that gets fixed, the performance should be almost identical when using prefetch_to_device.

jsimsa commented 4 years ago

I would also add that instead of having to explicitly take care of prefetching to device yourself through an experimental API, the recommended alternative is to use the tf.distribute API, which would takes care of prefetching to the device.

Flamefire commented 4 years ago

I'm surprised by that because a) preprocessing could be much faster on GPU (in case it becomes a bottleneck) and b) it is usually practice to overlap host<->device copies with computation. Not having that by default with TF sounds like a major oversight.

Can you elaborate what you mean by using the tf.distribute API? I would have expected that using a strategy like OneDeviceStrategy(device='/gpu:0') and placing the model and dataset definition inside that would be enough. But that didn't have any effect.

Again I'd expect the strategy and/or at least the fit/train_on_batch to be smart enough to overlap copies and computation

Flamefire commented 3 years ago

The issue still persists in TF 2.4: https://colab.research.google.com/gist/Saduf2019/a69b82f1ab451fd6da3cf50f76c55da7/2.ipynb

tilakrayal commented 3 months ago

Hi,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.

The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.

Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

Flamefire commented 3 months ago

I can't easily update the reproducer to TF 2.17 as the same code is giving me this error:

NotImplementedError: Cannot convert a symbolic tf.Tensor (StatefulPartitionedCall:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported.

See https://colab.research.google.com/gist/Saduf2019/a69b82f1ab451fd6da3cf50f76c55da7/2.ipynb

tensorflow / tensorflow

Using tf.data.Dataset has big overhead #38943