Open idan-tankel opened 11 months ago
Write loss wrapper for generate or to use score
The input_ids
that are given to the generate ,instead of the regular forward, are random model parameters concatenated with the image, so trying to create model.forward
instance that would not get input_ids is going to be harder than just wrapping the loss for generate
the output by forward
are conditioned on the input_ids
since:
input_id
The input_ids are not things that helps the model to generate - they only represent the proper beginning (the label is not used for that, like "if that was the beginning, what is the expected output?")That being said, what about the hidden states of the vision part? which are some of the sequence elements of the output hidden states? (on Blip2, they are 32 out of 51)
Problem description
The model loss uses the caption as instruction (via
input_ids
). In that case, the model than generate some "blob" of text which is not making any sense, like['</s> - stock image\n']
. The loss is than calculated on that kind of sentence - which is not making any sense as a smart loss to use.However, there is no BLIP2ForImageCaptioning, and If we pass only the image through the
generate
function to create a zero shot caption, the loss is not being calculated, only model out as tensor of tokens.Questions / points to test
use_decoder_only_language_model
? If it's true, the loss is calculated after generating - via CELoss between the preprocessed labels to the hidden_states of the text model, seein
model.forward()
method of the huggingface model versionmay we do this update manually to the zero shot out by generate? is that a good solution, even that the loss is only on outs and not embeddings?
another observations
when using the generate method, the loss is not intend to be calculated by the external BLIP2 wrapper. Within the internal BLIP2 generation method, the inputs are being modified by
prepare_inputs_for_generation
, and the labels are out!