Open AakashKumarNain opened 1 year ago
For the context: #1201 and subsequently https://github.com/keras-team/keras/issues/18419
Thanks for the context @shivance
Do you have pointers for the second question as well?
cc: @mattdangerw
@AakashKumarNain May be you could directly use : TokenAndPositionEmbedding
That would require me to train the models again
I tried replacing the custom PositionalEmbedding
layer with the TokenAndPositionEmbedding layer, and the training stats are way off now. A few interesting points:
QQ: After we made that change, did we train some models to validate that it is working as expected, especially for pretraining tasks?
@AakashKumarNain if the initialization isn't optimal could you try to override the embeddings_initializer
arg and report back any strong findings?
Unfortunately e2e tests for training workflows are extremely expensive and problem specific.
Unfortunately e2e tests for training workflows are extremely expensive and problem specific.
@jbischof I understand but common tasks like pretraining a BERT like model on a small dataset (e.g. Wikitext) don't take too much of time. The problem with not doing this exercise after a breaking change is that unless you train the model, you won't know if the changes broke something else. Loading weights and testing some outputs doesn't give a full picture about a breaking change.
@AakashKumarNain sorry about the breakage here!
For the original question above... would it work to replace the token_embedding
with a keras_nlp.layers. ReversibleEmbedding, and pass that token embedding to the masked language model head?
class PositionalEmbedding(layers.Layer):
def __init__(self, seq_length, input_dim, output_dim, **kwargs):
super().__init__(**kwargs)
self.seq_length = seq_length
self.input_dim = input_dim
self.output_dim = output_dim
self.token_embeddings = keras_nlp.layers.ReversibleEmbedding(input_dim=input_dim, output_dim=output_dim)
self.position_embeddings = layers.Embedding(input_dim=seq_length, output_dim=output_dim)
self.supports_masking = True
...
embedding = PositionalEmbedding(...)
masked_lm_head = keras_nlp.layers.MaskedLMHead(
embedding=embedding.token_embeddings,
activation=...,
)(encoded_tokens, masked_positions)
If your custom layer is registered as serializable, you should be able to just reload the old model I think. But hard to know without knowing the full way you are serializing. You could also consider doing old_model.save_weights(...)
with the old package version, and just load via load_weights
into a new model.
For some general context, we did decide to move away from the old way of passing embedding weights to the MaskedLMHead
because it would not save correctly with the upcoming Keras 3. Keras 3 will support deduplicating saved objects at the layer level, but not at the weights level, so this could lead to some unexpected behavior from the layer after a round trip of saving and loading.
In general we try to avoid breakages like this, but Keras 3 is a bit of an exception in that it is a large change and some API incompatibilities are inevitable.
We could however do a better job of documenting any known breaking changes when they occur in our release notes. This is a good reminder. And in general this sort of churn should die down after we release the next major version of Keras.
Thanks for the detailed response @mattdangerw
would it work to replace the token_embedding with a keras_nlp.layers. ReversibleEmbedding
I can try that, and will report back. Btw directly using TokenAndPositionEmbedding
layer from keras_nlp
didn't give me the desired results as pointed above.
In general we try to avoid breakages like this, but Keras 3 is a bit of an exception in that it is a large change and some API incompatibilities are inevitable.
I am aware of this. In fact, I was writing this code to add it to thecode examples
with the JAX backend. So, this is a bit of irony. When I noticed the change, I was surprised because I was under the impression that going forward, we will maintain backward compatibility to the fullest. I missed the issue linked in the Keras repo for the same. Let me try retraining the whole thing again.
PS: I still recommend testing e2e for breaking changes
Sounds good keep us posted!
And yeah the TokenAndPositionEmbedding
initialization is an interesting question. I don't think there is any way to guarantee stable training performance with arbitrary architectures and optimizers, that is a non-goal. The fact that we use the same initializer as keras.layers.Embedding
is good for the overall consistency of Keras.
If you are training a transformer from scratch, I would strongly recommend supplying initializers to most of your layers that you test for your particular arch/optimization strategy. You could try keras.initializers.TruncatedNormal(stddev=0.02)
as a magic constant from BERT that has been copied by many LLMs since.
I would consider it a bug that we don't support setting the position embedding initializer and token embedding initializer separately. We should probably extend out init arguments to allow this.
PS: I still recommend testing e2e for breaking changes
An integration test asserting the stability for a bit of training with a from scratch transformer is a good idea, I will open up an issue. Though the biggest fish we have to fry is going to be continuous multi-framework benchmarking for preset models as we launch Keras 3.
@AakashKumarNain Did you figure out a solution for the outputs = keras_nlp.layers.MaskedLMHead( embedding_weights=embedding_layer.token_embedding.embeddings, activation="softmax", )(encoded_tokens, mask_positions=inputs["mask_positions"])
Because I was following a tutorial to pretrain BERT on wikitext and currently stuck at this error.
Really appreciate your reply.
I noticed a change that was introduced in the
MaskedLMHead
layer, and it broke my entire workflow. Earlier we had the signature forMaskedLMHead
like this:where we could have passed the embedding matrix directly to the layer. Right now, it expects an instance of reversible embedding layer. I have two questions around this change: