Closed ys-zong closed 1 year ago
Good point - for the OPT model used in this paper, it should not matter if you mask the grad of lm_head
, since for its embedding matrix and lm_head
are tied (shared weights). For other models which do not have tied embeddings, you should also mask the lm_head
grads.
Got it, thanks very much!
I have a few more related questions:
resize_token_embeddings
, cls_token should be before the ret token, i.e. cls_token_idx = ret_token_idx - 1
? In the code, we have param.grad[mask, :] = 0, so my understanding is this cls_token is left untrained (the final parameters are as randomly initialized) and unused?<|image|>
above) to insert between caption and image tokens for in-context visual story retrieval? Maybe this will make the model get a better idea of what tokens are image and what are the captions.
- I saw a cls_token is added to the tokenizer but it doesn't seem to be used later. Is there any particular reason for doing this? Also, do we need to train the cls_token embedding vector, as after
resize_token_embeddings
, cls_token should be before the ret token, i.e.cls_token_idx = ret_token_idx - 1
? In the code, we have param.grad[mask, :] = 0, so my understanding is this cls_token is left untrained (the final parameters are as randomly initialized) and unused?
The CLS token is unused, yes. This was an artifact left from something early in development, so I should probably remove it.
- Do you think it would be helpful if we actually train a separation token (can be the
<|image|>
above) to insert between caption and image tokens for in-context visual story retrieval? Maybe this will make the model get a better idea of what tokens are image and what are the captions.
I think it might be helpful, but I'm not sure. It could potentially not be helpful, because the model would learn to distinguish image and text embeddings, and some of the generalization/zero-shot abilities might be lost (e.g., you might not be able to interleave images and text as easily). I would be curious to hear your results if you try this!
- Some language models' tokenizers don't have a padding token and we need to manually add a padding token to it. In this case, do we also need to train the embedding vector of the padding token as we do for [RET] token?
I think this depends on the task you are trying to do. I don't think the pad token is really used in the training of Fromage, because we ignore padding for the computation of cross entropy loss anyway. I think typically LLMs these days don't use padding since the text data is packed, so my guess would be that you don't need to train a new padding token in general.
Hope that helps!
Hi, I saw the discussion here #6 - the grad masking is a very neat implementation! But I was also wondering whether the weights of
lm_head
were also frozen. As a minimal snippet,After
resize_token_embeddings
, not only the requires_grad of the parameters of theinput_embeddings
were set to True, but also were the parameters oflm_head
. So, during training, the wholelm_head
keeps updating. But when saving/loading the weights, it seems only the weights of the [RET] token were saved/loaded, e.g. here. I wonder if we should also mask the grad oflm_head
during training here? I think it's related to whether the captioning loss should be calculated between the original lm_head or the new one.