Expose language model logit output from the GPT-2 model

mattdangerw commented 1 year ago

The original GPT-2 source contains the following lines, which reverses from a the final transformer embedding to output logits over the vocabulary space, using the token embedding weights: https://github.com/openai/gpt-2/blob/master/src/model.py#L171-L174

This is currently missing from the gpt-2 graph code we have checked in. For the model to be useful for things like generation (the primary use case), we need to expose this output in some form.

mattdangerw commented 1 year ago

We should take some care in figuring out how to best do this. I think this is a subtle question we shouldn't rush.

I am somewhat included to replicate the lines above directly in keras_nlp.models.Gpt2 (so the base model outputs vocabulary space logits). The logit output is really the canonical output of the model and covers the core use case. Using the last embedding dimension would feel like something that could be achieved by slicing into the functional model, as described here: https://github.com/keras-team/keras-nlp/issues/358

The other main option I can see is to keep what we currently have in keras_nlp.models.Gpt2, and expose a "head" model with the logits. keras_nlp.models.Gpt2TextGenerator or something like that. This also seems reasonable!

jbischof commented 1 year ago

I would lean toward following our current design pattern with the dense output as the "backbone" and task-specific heads as relying on a backbone.

Less confusing for our users
We don't have to offer presets for the backbone if it doesn't make sense
The dense output can still be used like any encoder-only model (OpenAI seems to prefer this encoding since they already have the model on hand)
If other tasks depend on text generation, they can inherit from the text generation task

Note that HF also follows this pattern:

GPT backbone in transformers (link)
LM head model relies on backbone (link)

mattdangerw commented 1 year ago

I am down for that!

Actually if we do this approach, I would be somewhat inclined to go "all in" on our backbone naming, and do keras_nlp.models.Gpt2Backbone, keras_nlp.models.T5Backbone keras_nlp.models.BertBackbone. But this issue is probably not the place for that :).

mattdangerw commented 1 year ago

Just a note that this exact same convo applies to T5.

The weight sharing is actually very nice. The backbone has all the weights needed for language modeling. If the embedding weights were not shared, we would need to ship separate checkpoints for the "backbone" and the "language modeler", which would be annoying.

jbischof commented 1 year ago

Wasn't the same true for BERT pretraining as well (link)? Given our functional model class I don't think this will be a problem.

mattdangerw commented 1 year ago

@jbischof yep! This was an aside, not a critique of what you were proposing. The functional model should be fine to handle it. (I've gotten some warnings about untracked lambda weights when I do think naively with logits = tf.matmul(x, embedding_transpose) but we can figure that out.)

I was interested in T5 here, because if you had a generative model with separate embedding weights, you would basically need to host two versions of every weights. E.g. let's say the very cool generative model P9 successor to T5 is released with a separate weights for LM logits.

# A smaller group of users who just want a dense representation of the sequence use this.
P9.from_preset("p9_base")

# Most users will want language model logits (for classification, generation, etc).
# They need a separate checkpoint to avoid retraining a massive d_model x vocab_size weight matrix.
P9TextGenerator.from_preset("p9_base_text_generator")

There are other things we could consider doing there too if it did come up. But thankfully, seems like that's not a very practical concern. Shared embeddings are basically universal!

jbischof commented 1 year ago

Very good point! It seems like in many cases we can offer the same presets for the LM task and backbone. I wonder if the call P9TextGenerator(backbone="p9_base") will be equivalent and if we will want to encourage one or the other...

mattdangerw commented 1 year ago

Ha! Apparently P9 exists and it is actually a follow up T5 release t5.1.1. "no parameter sharing between embedding and classifier layer"

The same preset could work if you did something like self.load_weights(path).expect_partial() on the backbone. But we can figure out when we get to this.

keras-team / keras-nlp

Expose language model logit output from the GPT-2 model #415