KerasNLP Bug/Error at Docstring/Documentation class example provided.

Describe the bug

I encountred an error/bug while trying to execute a docstring code example from the file keras_nlp.src.models.gpt2.causal_lm.py and I have reproduced the example code below:


features = ["a quick fox.", "a fox quick."]
vocab = {"<|endoftext|>": 0, "a": 4, "Ġquick": 5, "Ġfox": 6}
merges = ["Ġ q", "u i", "c k", "ui ck", "Ġq uick"]
merges += ["Ġ f", "o x", "Ġf ox"]

tokenizer = keras_nlp.models.GPT2Tokenizer(
    vocabulary=vocab,
    merges=merges,
)
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor(
    tokenizer=tokenizer,
    sequence_length=128,
)
backbone = keras_nlp.models.GPT2Backbone(
    vocabulary_size=30552,
    num_layers=4,
    num_heads=4,
    hidden_dim=256,
    intermediate_dim=512,
    max_sequence_length=128,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM(
    backbone=backbone,
    preprocessor=preprocessor,
)
gpt2_lm.fit(x=features, batch_size=2)

The following is a comprehensive description of the error, reproduced below and debugging using pdb:

> /home/humbulani/keras-master/env/lib/python3.10/site-packages/keras_nlp/src/models/causal_lm.py(79)__init__()
-> super().__init__(*args, **kwargs)
(Pdb) c
> /home/humbulani/keras-master/env/lib/python3.10/site-packages/keras_nlp/src/models/causal_lm.py(140)compile()
-> super().compile(
(Pdb) c
> /home/humbulani/keras-master/nlp_example.py(94)<module>()
-> gpt2_lm.fit(x=features, batch_size=2)
(Pdb) c
> /home/humbulani/keras-master/env/lib/python3.10/site-packages/keras_nlp/src/utils/pipeline_model.py(196)fit()
-> return super().fit(
(Pdb) c
> /home/humbulani/keras-master/keras/src/backend/tensorflow/trainer.py(269)fit()
-> self._assert_compile_called("fit")
(Pdb) c
> /home/humbulani/keras-master/keras/src/trainers/epoch_iterator.py(66)__init__()
-> self.data_adapter = data_adapters.get_data_adapter(
(Pdb) c
> /home/humbulani/keras-master/keras/src/backend/tensorflow/trainer.py(331)fit()
-> logs = self.train_function(iterator)
(Pdb) c
> /home/humbulani/keras-master/keras/src/backend/tensorflow/trainer.py(125)one_step_on_iterator()
-> """Runs a single training step given a Dataset iterator."""
(Pdb) c
> /home/humbulani/keras-master/keras/src/backend/tensorflow/trainer.py(50)train_step()
-> x, y, sample_weight = data_adapter_utils.unpack_x_y_sample_weight(data)
(Pdb) c
> /home/humbulani/keras-master/keras/src/losses/losses.py(1724)sparse_categorical_crossentropy()
-> res = ops.sparse_categorical_crossentropy(
(Pdb) c
2024-08-19 12:03:15.742874: W tensorflow/core/framework/op_kernel.cc:1840] OP_REQUIRES failed at sparse_xent_op.cc:103 : INVALID_ARGUMENT: Received a label value of -1 which is outside the valid range of [0, 30552).  Label values: 4 5 6 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 6 5 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2024-08-19 12:03:15.742930: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INVALID_ARGUMENT: Received a label value of -1 which is outside the valid range of [0, 30552).  Label values: 4 5 6 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 6 5 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Traceback (most recent call last):
  File "/home/humbulani/keras-master/nlp_example.py", line 94, in <module>
    gpt2_lm.fit(x=features, batch_size=2)
  File "/home/humbulani/keras-master/env/lib/python3.10/site-packages/keras_nlp/src/utils/pipeline_model.py", line 196, in fit
    return super().fit(
  File "/home/humbulani/keras-master/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/humbulani/keras-master/env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 5983, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__SparseSoftmaxCrossEntropyWithLogits_device_/job:localhost/replica:0/task:0/device:CPU:0}} Received a label value of -1 which is outside the valid range of [0, 30552).  Label values: 4 5 6 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 6 5 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [Op:SparseSoftmaxCrossEntropyWithLogits] name:

Th error is clear: the -1 value. I've traced the error to the following function from the file keras.src.backend.tensorflow.trainer:

@tf.autograph.experimental.do_not_convert
        def one_step_on_iterator(iterator):
            """Runs a single training step given a Dataset iterator."""
            data = next(iterator) 
            outputs = self.distribute_strategy.run(
                one_step_on_data, args=(data,)
            )
            outputs = reduce_per_replica(
                outputs,
                self.distribute_strategy,
                reduction="auto",
            )
            return outputs

The line data=next(iterator) computes the labels and therefore the -1 value is created here. The iterator argument is a tensorflow OwnedIterator and executes from the file tensorflow.python.data.ops.iterator_ops and the executed function reproduced below:

def _next_internal(self):
    autograph_status = autograph_ctx.control_status_ctx().status
    autograph_disabled = autograph_status == autograph_ctx.Status.DISABLED
    if not context.executing_eagerly() and autograph_disabled:
      self._get_next_call_count += 1
      if self._get_next_call_count > GET_NEXT_CALL_ERROR_THRESHOLD:
        raise ValueError(GET_NEXT_CALL_ERROR_MESSAGE)

    if not context.executing_eagerly():
      # TODO(b/169442955): Investigate the need for this colocation constraint.
      with ops.colocate_with(self._iterator_resource):
        ret = gen_dataset_ops.iterator_get_next(
            self._iterator_resource,
            output_types=self._flat_output_types,
            output_shapes=self._flat_output_shapes)
      return structure.from_compatible_tensor_list(self._element_spec, ret)

which executes gen_dataset_ops.iterator_get_next from the file tensorflow.python.data.ops.gen_dataset_ops, and from here to the relevant ops execution which I didn't trace further since it also leads to C++ execution code.

Enviroment

Linux 6.5.0-26-generic #26~22.04.1-Ubuntu
keras - 3.5.0
python - 3.10.12
tensorflow - 2.17.0
kerasNLP - 0.14.4

Additional tensorflow info:

2024-08-19 12:20:02.135293: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-19 12:20:02.154198: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-19 12:20:02.159831: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-19 12:20:02.174579: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-19 12:20:03.092334: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-08-19 12:20:04.517556: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

To Reproduce

Link to a Colab Notebook

Expected behavior

I expected the model to train normally by running the fit() function without any complications and return a History object.

Would you like to help us fix it?

keras-team / keras-hub

KerasNLP Bug/Error at Docstring/Documentation class example provided. #1784

Enviroment