[bug] TextVectorization + Sequential model doesn't work

WeichenXu123 commented 2 weeks ago

Tensorflow version: 2.19.0-dev20241108

Keras version: 3.7.0.dev2024111103

Installation command: pip install --pre tf-nightly

Reproducing code:

import numpy as np
import tensorflow as tf

def get_text_vec_model(train_samples):
    from tensorflow.keras.layers import TextVectorization
    VOCAB_SIZE = 10
    SEQUENCE_LENGTH = 16
    EMBEDDING_DIM = 16
    vectorizer_layer = TextVectorization(
        max_tokens=VOCAB_SIZE,
        output_mode="int",
        output_sequence_length=SEQUENCE_LENGTH,
    )
    vectorizer_layer.adapt(train_samples)
    model = tf.keras.Sequential(
        [
            vectorizer_layer,
            tf.keras.layers.Embedding(
                VOCAB_SIZE,
                EMBEDDING_DIM,
                name="embedding",
                mask_zero=True,
            ),
            tf.keras.layers.GlobalAveragePooling1D(),
            tf.keras.layers.Dense(16, activation="relu"),
            tf.keras.layers.Dense(1, activation="tanh"),
        ]
    )
    model.compile(optimizer="adam", loss="mse", metrics=["mae"])
    return model

train_samples = np.array(["this is an example", "another example"], dtype=object)
train_labels = np.array([0.4, 0.2])
model = get_text_vec_model(train_samples)

# Error: ValueError: Invalid dtype: object 
model.fit(train_samples, train_labels, epochs=1)

Error stack:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/weichen.xu/miniconda3/envs/mlflow/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/weichen.xu/miniconda3/envs/mlflow/lib/python3.9/site-packages/optree/ops.py", line 747, in tree_map
    return treespec.unflatten(map(func, *flat_args))
ValueError: Invalid dtype: object

The same code works in "keras==3.6.0"

fchollet commented 2 weeks ago

It seems we're no longer detecting object arrays as string arrays, probably because we've upgraded our numpy dependency. Object arrays are ambiguous since they can contain anything, not just strings.

I recommend instead using tf.string tensors, which are explicitly strings and are also much more memory efficient:

train_samples = tf.convert_to_tensor(["this is an example", "another example"])

This would fix your code example.

fchollet commented 2 weeks ago

I fixed it at HEAD, regardless.

google-ml-butler[bot] commented 2 weeks ago

Are you satisfied with the resolution of your issue? Yes No

keras-team / keras

[bug] TextVectorization + Sequential model doesn't work #20479