Weights of `lm_head` were not frozen during training?

ys-zong commented 1 year ago

Hi, I saw the discussion here #6 - the grad masking is a very neat implementation! But I was also wondering whether the weights of lm_head were also frozen. As a minimal snippet,

from transformers import AutoTokenizer, AutoModelForCausalLM
opt_lm_tokenizer = AutoTokenizer.from_pretrained('facebook/opt-125m')
opt_lm = AutoModelForCausalLM.from_pretrained('facebook/opt-125m')

for param in opt_lm.parameters():
    param.requires_grad = False
for name, param in opt_lm.named_parameters():
    print(name, param.requires_grad)

opt_lm_tokenizer.add_tokens('[RET]')
opt_lm.resize_token_embeddings(len(opt_lm_tokenizer))
input_embeddings = opt_lm.get_input_embeddings()
print('--------------------------------------------')
for name, param in input_embeddings.named_parameters():
    print(name, param.requires_grad)
for name, param in opt_lm.lm_head.named_parameters():
    print(name, param.requires_grad)

After resize_token_embeddings, not only the requires_grad of the parameters of the input_embeddings were set to True, but also were the parameters of lm_head. So, during training, the whole lm_head keeps updating. But when saving/loading the weights, it seems only the weights of the [RET] token were saved/loaded, e.g. here. I wonder if we should also mask the grad of lm_head during training here? I think it's related to whether the captioning loss should be calculated between the original lm_head or the new one.

kohjingyu commented 1 year ago

Good point - for the OPT model used in this paper, it should not matter if you mask the grad of lm_head, since for its embedding matrix and lm_head are tied (shared weights). For other models which do not have tied embeddings, you should also mask the lm_head grads.

ys-zong commented 1 year ago

Got it, thanks very much!

I have a few more related questions:

I saw a cls_token is added to the tokenizer but it doesn't seem to be used later. Is there any particular reason for doing this? Also, do we need to train the cls_token embedding vector, as after resize_token_embeddings, cls_token should be before the ret token, i.e. cls_token_idx = ret_token_idx - 1? In the code, we have param.grad[mask, :] = 0, so my understanding is this cls_token is left untrained (the final parameters are as randomly initialized) and unused?
Do you think it would be helpful if we actually train a separation token (can be the <|image|> above) to insert between caption and image tokens for in-context visual story retrieval? Maybe this will make the model get a better idea of what tokens are image and what are the captions.
Some language models' tokenizers don't have a padding token and we need to manually add a padding token to it. In this case, do we also need to train the embedding vector of the padding token as we do for [RET] token?

kohjingyu commented 1 year ago

I saw a cls_token is added to the tokenizer but it doesn't seem to be used later. Is there any particular reason for doing this? Also, do we need to train the cls_token embedding vector, as after resize_token_embeddings, cls_token should be before the ret token, i.e. cls_token_idx = ret_token_idx - 1? In the code, we have param.grad[mask, :] = 0, so my understanding is this cls_token is left untrained (the final parameters are as randomly initialized) and unused?

The CLS token is unused, yes. This was an artifact left from something early in development, so I should probably remove it.

Do you think it would be helpful if we actually train a separation token (can be the <|image|> above) to insert between caption and image tokens for in-context visual story retrieval? Maybe this will make the model get a better idea of what tokens are image and what are the captions.

I think it might be helpful, but I'm not sure. It could potentially not be helpful, because the model would learn to distinguish image and text embeddings, and some of the generalization/zero-shot abilities might be lost (e.g., you might not be able to interleave images and text as easily). I would be curious to hear your results if you try this!

Some language models' tokenizers don't have a padding token and we need to manually add a padding token to it. In this case, do we also need to train the embedding vector of the padding token as we do for [RET] token?

I think this depends on the task you are trying to do. I don't think the pad token is really used in the training of Fromage, because we ignore padding for the computation of cross entropy loss anyway. I think typically LLMs these days don't use padding since the text data is packed, so my guess would be that you don't need to train a new padding token in general.

Hope that helps!

kohjingyu / fromage

Weights of `lm_head` were not frozen during training? #18