mrdbourke / tensorflow-deep-learning

All course materials for the Zero to Mastery Deep Learning with TensorFlow course.
https://dbourke.link/ZTMTFcourse
MIT License
5.14k stars 2.53k forks source link

Help with your assignment-build model_5 with custom embedding like model _1 #341

Closed Snir-Dekel closed 2 years ago

Snir-Dekel commented 2 years ago

From: video 285, Comparing the performance of all of our modelling experiments Time in video: 3:53 / 9:36 Demo: https://colab.research.google.com/drive/1qrg2y7bTCVZwWS_xxpBQ8i-Y3EMu1wij?usp=sharing#scrollTo=b5147e40-b745-4e1d-9c9c-730e42ee4501

My implementation code:

token_inputs = layers.Input(shape=(1,), dtype="string", name="token_inputs")
x = text_vectorizer(token_inputs)
x = token_embed(x)
x = layers.GlobalAveragePooling1D()(x)
token_outputs = layers.Dense(128, activation="relu")(x)

When fitting the model with the code above, I am getting a nan loss.

Original (Your code):

token_inputs = layers.Input(shape=[], dtype="string", name="token_inputs")
token_embeddings = tf_hub_embedding_layer(token_inputs)
token_outputs = layers.Dense(128, activation="relu")(token_embeddings)
mrdbourke commented 2 years ago

Hey hey, thank you for the issue.

Looks like it may have been solved though?

@beneyal found a fix and setup this notebook: https://colab.research.google.com/drive/1NQBvn3QcQtcBm4xW00zjG66Zs1cMXYpE?usp=sharing

Turns out it's because of the TextVectorizer turning the sentences with @ and . into nothing... causing NaN's in the loss.

Some questions from yourself and explanations from @beneyal:

  1. Why does the loss function return nan? And why only at custom embeddings?
  2. Why does the loss function keep returning nan for everything after the first nan?

There's one way in which categorical cross entropy will return nan and that's when we reach a point where we do log(0.0), which is part of the categorical cross entropy loss calculation.

If you have nan in the loss, the optimizer will take the loss and derive it to get the errors so that it can update the weights of the layers.

Since nan is what is sometimes called "an absorbing element", everything that touches it will become nan, and so every weight that was affected by this loss is now nan itself, which means that now every calculation will include that nan, which finally leads to nan everywhere.

EDIT: Everything I said above is true, but it is not the root cause of the problem in our case.

I traced the problem to TextVectorizer that has a "nice" default: standardize='lower_and_strip_punctuation', which, as the name suggests, gets rid of punctuation.

Text like "@ ) ." is all punctuation, so it's stripped completely.

The vectorizer gives us a vector of only zeros.

The embeddings are masking zeros (mask_zero=True), so the GlobalAveragePooling1D doesn't "see" any elements, and try to average zero elements of zero value, giving nan.

So it appears the problem could have been also solved by 1) not stripping punctuation in the TextVectorizer, or 2) Not masking the embeddings (not recommended).

Anyway, the nan coming out from the pooling layer, naturally came out as nan from the loss function, creating the mess mentioned above.