Open dgoldenberg-audiomack opened 3 years ago
To reproduce:
Some strangeness:
# The query tower
user_model = tf.keras.Sequential(
[
tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=unique_user_ids),
# We add an additional embedding to account for unknown tokens.
tf.keras.layers.Embedding(len(unique_user_ids) + 2, embedding_dimension),
]
)
# The candidate tower
movie_model = tf.keras.Sequential(
[
tf.keras.layers.experimental.preprocessing.StringLookup(
vocabulary=unique_movie_titles, mask_token=None
),
tf.keras.layers.Embedding(len(unique_movie_titles) + 2, embedding_dimension),
]
)
Per the doc,
input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.
Is there an off by 1 somewhere? Doing + 2 seems to let the code pass. Confusing.
The StringLookup
layer can have mask tokens/multiple OOV tokens. A principled and reliable way of getting its output size is to call its vocab_size method.
Great point, @maciejkula, thanks! This worked:
# The query tower
u_lookup = tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=unique_user_ids)
user_model = tf.keras.Sequential(
[
u_lookup,
# We add an additional embedding to account for unknown tokens.
tf.keras.layers.Embedding(u_lookup.vocab_size() + 1, embedding_dimension),
]
)
# The candidate tower
c_lookup = tf.keras.layers.experimental.preprocessing.StringLookup(
vocabulary=unique_movie_titles, mask_token=None
)
movie_model = tf.keras.Sequential(
[
c_lookup,
# We add an additional embedding to account for unknown tokens.
tf.keras.layers.Embedding(c_lookup.vocab_size() + 1, embedding_dimension),
]
)
I'd like to suggest that the tutorials documentation be changed to reflect this coding approach; others may run into the scenario I ran into.
My code is similar to the retrieval sample. I get an error as the below. Any ideas? (looked at some posts on stackoverflow, they don't seem relevant).
I don't have any
dataset.cache().take(k).repeat()
occurrences, simply this as in the sample:Stack: