huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.85k stars 26.97k forks source link

[TFTrainer] Error "iterating over `tf.Tensor` is not allowed" #6362

Closed EibrielInv closed 4 years ago

EibrielInv commented 4 years ago

Environment info

Who can help

Trainer: @sgugger tensorflow: @jplu

Information

Model I am using (Bert, XLNet ...): GPT2

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. Install Tensorflow 2.3.0, Transformers 3.0.2

  2. Run the following code:

from transformers import TFGPT2LMHeadModel, TFTrainer, TFTrainingArguments
import tensorflow as tf

tfds_train_dataset = tf.data.Dataset.from_tensor_slices(
    tf.random.uniform([4000, 1024], minval=1, maxval=10, dtype=tf.int32))

model = TFGPT2LMHeadModel.from_pretrained("gpt2")

training_args = TFTrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = TFTrainer(
    model=model,
    args=training_args,
    train_dataset=tfds_train_dataset,
)

trainer.train()
  1. Results in the following output + error:
    
    2020-08-09 01:41:28.331697: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
    2020-08-09 01:41:30.461375: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
    2020-08-09 01:41:30.466239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
    pciBusID: 0000:01:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
    coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
    2020-08-09 01:41:30.466271: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
    2020-08-09 01:41:30.468575: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
    2020-08-09 01:41:30.470629: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
    2020-08-09 01:41:30.471013: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
    2020-08-09 01:41:30.473522: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
    2020-08-09 01:41:30.474947: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
    2020-08-09 01:41:30.481193: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
    2020-08-09 01:41:30.482710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
    2020-08-09 01:41:30.483080: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2020-08-09 01:41:30.512602: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3210790000 Hz
    2020-08-09 01:41:30.514335: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4c678f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2020-08-09 01:41:30.514408: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    2020-08-09 01:41:30.648534: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4c92000 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
    2020-08-09 01:41:30.648597: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
    2020-08-09 01:41:30.650365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
    pciBusID: 0000:01:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
    coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
    2020-08-09 01:41:30.650446: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
    2020-08-09 01:41:30.650523: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
    2020-08-09 01:41:30.650586: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
    2020-08-09 01:41:30.650646: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
    2020-08-09 01:41:30.650708: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
    2020-08-09 01:41:30.650767: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
    2020-08-09 01:41:30.650829: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
    2020-08-09 01:41:30.653179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
    2020-08-09 01:41:30.653232: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
    2020-08-09 01:41:31.392168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-08-09 01:41:31.392212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
    2020-08-09 01:41:31.392225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
    2020-08-09 01:41:31.393566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7389 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
    2020-08-09 01:41:34.003855: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
    2020-08-09 01:41:34.145974: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
    All model checkpoint weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2. If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training. Traceback (most recent call last): File "gpt2-training_bug.py", line 26, in trainer.train() File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/transformers/trainer_tf.py", line 412, in train for step, training_loss in enumerate(self._training_steps(train_ds, optimizer)): File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/transformers/trainer_tf.py", line 459, in _training_steps for i, loss in enumerate(self._accumulate_next_gradients(ds)): File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/transformers/trainer_tf.py", line 492, in _accumulate_next_gradients yield _accumulate_next() File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in call result = self._call(*args, kwds) File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 823, in _call self._initialize(args, kwds, add_initializers_to=initializers) File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 697, in _initialize *args, *kwds)) File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2855, in _get_concrete_function_internal_garbage_collected graphfunction, , _ = self._maybe_define_function(args, kwargs) File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3213, in _maybe_define_function graph_function = self._create_graph_function(args, kwargs) File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3075, in _create_graph_function capture_by_value=self._capture_by_value), File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 986, in func_graph_from_py_func func_outputs = python_func(func_args, func_kwargs) File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 600, in wrapped_fn return weak_wrapped_fn().wrapped(*args, **kwds) File "/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 973, in wrapper raise e.ag_error_metadata.to_exception(e) tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: in user code:

/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/transformers/trainer_tf.py:486 _accumulate_next  *
    per_replica_features, per_replica_labels = next(iterator)
/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:503 __iter__
    self._disallow_iteration()
/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:496 _disallow_iteration
    self._disallow_when_autograph_enabled("iterating over `tf.Tensor`")
/home/gabriel/venv/GPT-Hug/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:474 _disallow_when_autograph_enabled
    " indicate you are trying to use an unsupported feature.".format(task))

OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.


<!-- If you have code snippets, error messages, stack traces please provide them here as well.
     Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
     Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->

## Expected behavior

Start Training

<!-- A clear and concise description of what you would expect to happen. -->
EibrielInv commented 4 years ago

The following bug on Tensorflow could be related: https://github.com/tensorflow/tensorflow/issues/42119

EibrielInv commented 4 years ago

Was just a Dataset setup issue. The correct setup for the Dataset can be seen here https://github.com/huggingface/transformers/issues/6551