Open Flamefire opened 4 years ago
Hi @Flamefire Could you try some thing like this:
if useData:
x = tf.random.uniform([1, 32, 224, 224, 3])
y = tf.random.uniform([1, 32, 1], minval=0, maxval=999, dtype=tf.int64)
dataset = tf.data.Dataset.from_tensor_slices((x, y)).repeat()
def train(steps):
dataset = dataset.batch(batch_size=steps)
dataset = dataset.prefetch(1)
for x, y in dataset:
train_step(x, y)
What's the performance data?
Did another experiment: Adding @tf.function
above the train
function slows it down significantly and using the dataset is now twice as fast:
useData: True -> 165.09024090506136
useData: False -> 380.11968565313146
Not sure if that changes semantics though and it can't be easily done for the "real" code as some callbacks can't be run inside a tf.function
dataset = dataset.batch(batch_size=steps)
steps
is the number of batches, not the batch_size. The batch_size is 32 (wanted to avoid the batch layer which likely makes it worse, hence that is included in the uniform call already)
Putting the prefetch(1)
after the take
(used to limit number of batches) it makes execution slighly slower, but can be called the same as difference is small: useData: True -> 88.12129278900102
@Flamefire I ran the code shared and face indentation errors, please provide complete code with all dependencies and indentation such that we could replicate the error faced. If possible please provide colab gist for us to analyse the error faced.
Yes, seems that I missed the code tags on Github, see the updated code above (the loop contents was not indented)
@Flamefire I ran the code shared and face a different error, please find the gist here
@Saduf2019 There seems to be an issue with your Colab instance and/or way of installing TF 2.1.0. I just verified this code locally on a system without a GPU and on our cluster with a GPU and it is working fine. Even if that error shown was be valid, it would be yet another bug of TF as the code is supposed to work according to the TF documentation (see "custom training loop")
Example why your Colab instance is erroneous: Part of the call stack is
578 xla_context.Exit()
579 else:
--> 580 result = self._call(*args, **kwds)
581
582 if tracing_count == self._get_tracing_count():
The related code on my system and on Github is: https://github.com/tensorflow/tensorflow/blob/v2.1.0/tensorflow/python/eager/def_function.py#L568
As you can see the line numbers don't match so your Colab instance is using another version of TF
I restarted the colab you posted myself and it seems to be doing something, i.e. not immediately failing with an error. As it seems to be using CPU it's running for ages now. I'd suggest trying with a GPU where runtime is a few seconds.
@Flamefire I ran the code on gpu, please find the gist here and confirm if it replicated your issue.
@Saduf2019 Yes this seems to work. The speed difference in that colab is only a bit more than 1% though, so I guess the used GPU is rather slow (total time is ~200s where on my machine it is ~90s)
Seeing the difference from using tf.data.Dataset getting smaller the longer the GPU takes for training is expected so yes that replicates the issue.
@Flamefire The tf.data.Dataset
example is slicing a 4D tensor into a 3D tensor (which requires copying the data every step), while the non-Dataset code starts with 3D tensors and therefore doesn't need to copy. To compare apples to apples here, you should define the Dataset data with
x = tf.random.uniform([32, 224, 224, 3])
y = tf.random.uniform([32, 1], minval=0, maxval=999, dtype=tf.int64)
dataset = tf.data.Dataset.from_tensors((x, y)).repeat()
@aaudiber Thanks for the suggestion. I'd very much expected the trivial copy to be completely overlapped by the computation of the not-so-small Resnet especially as prefetch
was used with no difference
Tried your suggestion anyway:
With from_tensor_slices
and prefetch(1)
:
useData: True -> 89.55927550140768
useData: False -> 87.00254090223461
With from_tensors
and prefetch(1)
:
useData: True -> 88.65487134549767
useData: False -> 86.93021802790463
So you can see there is an effect but using the dataset is still slower especially as no real work is done by it.
For reference the update code:
import tensorflow as tf
from timeit import timeit
@tf.function
def train_step(x, y):
model.train_on_batch(x, y)
for useData in (True, False):
model = tf.keras.applications.ResNet50(weights=None, classes=1000)
model.compile(
loss=tf.losses.SparseCategoricalCrossentropy(),
optimizer=tf.keras.optimizers.SGD(),
metrics=['accuracy'],
experimental_run_tf_function=True)
if useData:
x = tf.random.uniform([32, 224, 224, 3])
y = tf.random.uniform([32, 1], minval=0, maxval=999, dtype=tf.int64)
dataset = tf.data.Dataset.from_tensors((x, y)).repeat()
def train(steps):
for x, y in dataset.take(steps).prefetch(1):
train_step(x, y)
else:
x = tf.random.uniform([32, 224, 224, 3])
y = tf.random.uniform([32, 1], minval=0, maxval=999, dtype=tf.int64)
def train(steps):
for _ in range(steps):
train_step(x, y)
# warmup
train(2)
t = timeit(lambda: train(50), number=10)
print('useData: %s -> %s' % (useData, t))
Thanks @Flamefire.
This is a difficult case for tf.data.Dataset
because there isn't any preprocessing. tf.data.Dataset
usually does preprocessing on the CPU, then transfers the data to the GPU afterward. The tf.data.Dataset
example is slower because it is copying the tensors from GPU memory to CPU memory and back each time, while the non-Dataset example starts with the tensors on the GPU and doesn't need to move them at all since there isn't any preprocessing.
Ideally we could use tf.data.experimental.prefetch_to_device to prefetch to the GPU and recover the performance, but there is currently an outstanding bug with prefetch_to_device. Once that gets fixed, the performance should be almost identical when using prefetch_to_device
.
I would also add that instead of having to explicitly take care of prefetching to device yourself through an experimental API, the recommended alternative is to use the tf.distribute API, which would takes care of prefetching to the device.
I'm surprised by that because a) preprocessing could be much faster on GPU (in case it becomes a bottleneck) and b) it is usually practice to overlap host<->device copies with computation. Not having that by default with TF sounds like a major oversight.
Can you elaborate what you mean by using the tf.distribute
API? I would have expected that using a strategy like OneDeviceStrategy(device='/gpu:0')
and placing the model and dataset definition inside that would be enough. But that didn't have any effect.
Again I'd expect the strategy and/or at least the fit/train_on_batch to be smart enough to overlap copies and computation
The issue still persists in TF 2.4: https://colab.research.google.com/gist/Saduf2019/a69b82f1ab451fd6da3cf50f76c55da7/2.ipynb
Hi,
Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.
The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.
Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.
This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.
I can't easily update the reproducer to TF 2.17 as the same code is giving me this error:
NotImplementedError: Cannot convert a symbolic tf.Tensor (StatefulPartitionedCall:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported.
See https://colab.research.google.com/gist/Saduf2019/a69b82f1ab451fd6da3cf50f76c55da7/2.ipynb
System information
Describe the current behavior
Using a
Dataset
reduces performance by a small but significant amount, ~7% for ImageNet like dataDescribe the expected behavior
Using
Dataset
has no or only marginal performance impactStandalone code to reproduce the issue
Sample output: useData: True -> 89.92945478390902 useData: False -> 86.73652107780799
For more realistic training loops (e.g. including callbacks) the difference is even bigger. Some of my tests:
First number is calculated from training loop execution time (after warmup) the latter only the train-step and the difference (to the first number) which I called "preprocessing" as it is iterating over the dataset (calling next on the iterator by the for loop) and hence dominated by preprocessing functions if present (none here) including the
repeat
andtake
Dataset adapters.So 2 conclusions: Getting elements from the iterator seems to be quite costly (1->13.6) and even the training loop itself gets slower (498 -> 479)
This would be a reason to avoid the dataset API.