Closed mattdangerw closed 3 weeks ago
We should take some care in figuring out how to best do this. I think this is a subtle question we shouldn't rush.
I am somewhat included to replicate the lines above directly in keras_nlp.models.Gpt2
(so the base model outputs vocabulary space logits). The logit output is really the canonical output of the model and covers the core use case. Using the last embedding dimension would feel like something that could be achieved by slicing into the functional model, as described here: https://github.com/keras-team/keras-nlp/issues/358
The other main option I can see is to keep what we currently have in keras_nlp.models.Gpt2
, and expose a "head" model with the logits. keras_nlp.models.Gpt2TextGenerator
or something like that. This also seems reasonable!
I would lean toward following our current design pattern with the dense output as the "backbone" and task-specific heads as relying on a backbone.
Note that HF also follows this pattern:
I am down for that!
Actually if we do this approach, I would be somewhat inclined to go "all in" on our backbone naming, and do keras_nlp.models.Gpt2Backbone
, keras_nlp.models.T5Backbone
keras_nlp.models.BertBackbone
. But this issue is probably not the place for that :).
Just a note that this exact same convo applies to T5.
The weight sharing is actually very nice. The backbone has all the weights needed for language modeling. If the embedding weights were not shared, we would need to ship separate checkpoints for the "backbone" and the "language modeler", which would be annoying.
Wasn't the same true for BERT pretraining as well (link)? Given our functional model class I don't think this will be a problem.
@jbischof yep! This was an aside, not a critique of what you were proposing. The functional model should be fine to handle it. (I've gotten some warnings about untracked lambda weights when I do think naively with logits = tf.matmul(x, embedding_transpose)
but we can figure that out.)
I was interested in T5 here, because if you had a generative model with separate embedding weights, you would basically need to host two versions of every weights. E.g. let's say the very cool generative model P9 successor to T5 is released with a separate weights for LM logits.
# A smaller group of users who just want a dense representation of the sequence use this.
P9.from_preset("p9_base")
# Most users will want language model logits (for classification, generation, etc).
# They need a separate checkpoint to avoid retraining a massive d_model x vocab_size weight matrix.
P9TextGenerator.from_preset("p9_base_text_generator")
There are other things we could consider doing there too if it did come up. But thankfully, seems like that's not a very practical concern. Shared embeddings are basically universal!
Very good point! It seems like in many cases we can offer the same presets for the LM task and backbone. I wonder if the call P9TextGenerator(backbone="p9_base")
will be equivalent and if we will want to encourage one or the other...
Ha! Apparently P9
exists and it is actually a follow up T5 release t5.1.1. "no parameter sharing between embedding and classifier layer"
The same preset could work if you did something like self.load_weights(path).expect_partial()
on the backbone. But we can figure out when we get to this.
The original GPT-2 source contains the following lines, which reverses from a the final transformer embedding to output logits over the vocabulary space, using the token embedding weights: https://github.com/openai/gpt-2/blob/master/src/model.py#L171-L174
This is currently missing from the gpt-2 graph code we have checked in. For the model to be useful for things like generation (the primary use case), we need to expose this output in some form.