'Retrieval' type of code causes error InvalidArgumentError: indices[7505] = 672 is not in [0, 672)

dgoldenberg-audiomack commented 3 years ago

My code is similar to the retrieval sample. I get an error as the below. Any ideas? (looked at some posts on stackoverflow, they don't seem relevant).

I don't have any dataset.cache().take(k).repeat() occurrences, simply this as in the sample:

cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

Stack:

WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32 WARNING:tensorflow:Gradients do not exist for variables ['counter:0'] when minimizing the loss. WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32 WARNING:tensorflow:Gradients do not exist for variables ['counter:0'] when minimizing the loss.

Traceback (most recent call last): File "/mnt/tmp/spark-9016e2b7-816e-4941-abde-dd6c43c753e8/recsys_tfrs_proto.py", line 352, in main(sys.argv) File "/mnt/tmp/spark-9016e2b7-816e-4941-abde-dd6c43c753e8/recsys_tfrs_proto.py", line 104, in main model = create_and_train_model(movies_ds, test, train, unique_movie_titles, unique_user_ids) File "/mnt/tmp/spark-9016e2b7-816e-4941-abde-dd6c43c753e8/recsys_tfrs_proto.py", line 181, in create_and_train_model train_and_evaluate(cached_test, cached_train, model, 3) File "/mnt/tmp/spark-9016e2b7-816e-4941-abde-dd6c43c753e8/recsys_tfrs_proto.py", line 191, in train_and_evaluate model.fit(cached_train, epochs=num_epochs, verbose=0) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit tmp_logs = self.train_function(iterator) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 828, in call result = self._call(*args, *kwds) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call return self._stateless_fn(args, **kwds) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2943, in call filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 560, in call ctx=ctx) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[7505] = 672 is not in [0, 672) [[node sequential/embedding/embedding_lookup (defined at mnt/tmp/spark-9016e2b7-816e-4941-abde-dd6c43c753e8/recsys_tfrs_proto.py:343) ]] [Op:__inference_train_function_530001]

Function call stack: train_function

2021-01-07 23:35:27.709152: W tensorflow/core/kernels/data/cache_dataset_ops.cc:757] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead.

dgoldenberg-audiomack commented 3 years ago

To reproduce:

tensorflow==2.4.0
tensorflow_recommenders==v0.3.2
tensorflow-io-nightly # Nightly fixes https://github.com/tensorflow/io/issues/1254 (for handling of strings in input parquet)
Run the tfrs_202_scenario.py from the attached zip (adjust the path to point at the location of unzipped parquet files)

tfrs_202.zip

dgoldenberg-audiomack commented 3 years ago

Some strangeness:

        # The query tower
        user_model = tf.keras.Sequential(
            [
                tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=unique_user_ids),
                # We add an additional embedding to account for unknown tokens.
                tf.keras.layers.Embedding(len(unique_user_ids) + 2, embedding_dimension),
            ]
        )

        # The candidate tower
        movie_model = tf.keras.Sequential(
            [
                tf.keras.layers.experimental.preprocessing.StringLookup(
                    vocabulary=unique_movie_titles, mask_token=None
                ),
                tf.keras.layers.Embedding(len(unique_movie_titles) + 2, embedding_dimension),
            ]
        )

Per the doc,

input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.

Is there an off by 1 somewhere? Doing + 2 seems to let the code pass. Confusing.

maciejkula commented 3 years ago

The StringLookup layer can have mask tokens/multiple OOV tokens. A principled and reliable way of getting its output size is to call its vocab_size method.

dgoldenberg-audiomack commented 3 years ago

Great point, @maciejkula, thanks! This worked:

        # The query tower
        u_lookup = tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=unique_user_ids)
        user_model = tf.keras.Sequential(
            [
                u_lookup,
                # We add an additional embedding to account for unknown tokens.
                tf.keras.layers.Embedding(u_lookup.vocab_size() + 1, embedding_dimension),
            ]
        )

        # The candidate tower
        c_lookup = tf.keras.layers.experimental.preprocessing.StringLookup(
            vocabulary=unique_movie_titles, mask_token=None
        )
        movie_model = tf.keras.Sequential(
            [
                c_lookup,
                # We add an additional embedding to account for unknown tokens.
                tf.keras.layers.Embedding(c_lookup.vocab_size() + 1, embedding_dimension),
            ]
        )

I'd like to suggest that the tutorials documentation be changed to reflect this coding approach; others may run into the scenario I ran into.

tensorflow / recommenders

'Retrieval' type of code causes error InvalidArgumentError: indices[7505] = 672 is not in [0, 672) #202