I encountred an error/bug while trying to execute a docstring code example from the file keras_nlp.src.models.gpt2.causal_lm.py and I have reproduced the example code below:
Th error is clear: the -1 value. I've traced the error to the following function from the file keras.src.backend.tensorflow.trainer:
@tf.autograph.experimental.do_not_convert
def one_step_on_iterator(iterator):
"""Runs a single training step given a Dataset iterator."""
data = next(iterator)
outputs = self.distribute_strategy.run(
one_step_on_data, args=(data,)
)
outputs = reduce_per_replica(
outputs,
self.distribute_strategy,
reduction="auto",
)
return outputs
The line data=next(iterator) computes the labels and therefore the -1 value is created here. The iterator argument is a tensorflow OwnedIterator and executes from the file tensorflow.python.data.ops.iterator_ops and the executed function reproduced below:
def _next_internal(self):
autograph_status = autograph_ctx.control_status_ctx().status
autograph_disabled = autograph_status == autograph_ctx.Status.DISABLED
if not context.executing_eagerly() and autograph_disabled:
self._get_next_call_count += 1
if self._get_next_call_count > GET_NEXT_CALL_ERROR_THRESHOLD:
raise ValueError(GET_NEXT_CALL_ERROR_MESSAGE)
if not context.executing_eagerly():
# TODO(b/169442955): Investigate the need for this colocation constraint.
with ops.colocate_with(self._iterator_resource):
ret = gen_dataset_ops.iterator_get_next(
self._iterator_resource,
output_types=self._flat_output_types,
output_shapes=self._flat_output_shapes)
return structure.from_compatible_tensor_list(self._element_spec, ret)
which executes gen_dataset_ops.iterator_get_next from the file tensorflow.python.data.ops.gen_dataset_ops, and from here to the relevant ops execution which I didn't trace further since it also leads to C++ execution code.
Enviroment
Linux 6.5.0-26-generic #26~22.04.1-Ubuntu
keras - 3.5.0
python - 3.10.12
tensorflow - 2.17.0
kerasNLP - 0.14.4
Additional tensorflow info:
2024-08-19 12:20:02.135293: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-19 12:20:02.154198: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-19 12:20:02.159831: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-19 12:20:02.174579: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-19 12:20:03.092334: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-08-19 12:20:04.517556: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Describe the bug
I encountred an error/bug while trying to execute a docstring code example from the file
keras_nlp.src.models.gpt2.causal_lm.py
and I have reproduced the example code below:The following is a comprehensive description of the error, reproduced below and debugging using
pdb
:Th error is clear:
the -1 value
. I've traced the error to the following function from the filekeras.src.backend.tensorflow.trainer
:The line
data=next(iterator)
computes the labels and therefore the -1 value is created here. Theiterator
argument is a tensorflowOwnedIterator
and executes from the filetensorflow.python.data.ops.iterator_ops
and the executed function reproduced below:which executes
gen_dataset_ops.iterator_get_next
from the filetensorflow.python.data.ops.gen_dataset_ops
, and from here to the relevant ops execution which I didn't trace further since it also leads to C++ execution code.Enviroment
To Reproduce
Link to a Colab Notebook
Expected behavior
I expected the model to train normally by running the
fit()
function without any complications and return aHistory
object.Would you like to help us fix it?