Use the captioning loss instead of the conditionalGenerationLoss

idan-tankel commented 11 months ago

Problem description

The model loss uses the caption as instruction (via input_ids). In that case, the model than generate some "blob" of text which is not making any sense, like ['</s> - stock image\n']. The loss is than calculated on that kind of sentence - which is not making any sense as a smart loss to use.

However, there is no BLIP2ForImageCaptioning, and If we pass only the image through the generate function to create a zero shot caption, the loss is not being calculated, only model out as tensor of tokens.

Questions / points to test

[ ] in Zero shot captioning described as ConditionalGeneration with no label - what are the input_ids for the internal text model there? are these the image_outs as given by Qformer? Is there "a default prompt" of captioning there?
[ ] In that link they are trying to train on zero shot captioning and passing the label also as input_id. Is that legal? Is this what we need to do?
[ ] What about the switch use_decoder_only_language_model ? If it's true, the loss is calculated after generating - via CELoss between the preprocessed labels to the hidden_states of the text model, see

if labels is not None:
                labels = labels.to(logits.device)
                logits = logits[:, -labels.size(1) :, :]
                # Shift so that tokens < n predict n
                shift_logits = logits[..., :-1, :].contiguous()
                shift_labels = labels[..., 1:].contiguous().to(logits.device)

                # Flatten the tokens
                loss_fct = CrossEntropyLoss(reduction="mean")

                loss = loss_fct(shift_logits.view(-1, self.config.text_config.vocab_size), shift_labels.view(-1))

in model.forward() method of the huggingface model version

may we do this update manually to the zero shot out by generate? is that a good solution, even that the loss is only on outs and not embeddings?

another observations

when using the generate method, the loss is not intend to be calculated by the external BLIP2 wrapper. Within the internal BLIP2 generation method, the inputs are being modified by prepare_inputs_for_generation, and the labels are out!

idan-tankel commented 11 months ago

https://github.com/idan-tankel/SemOOD/blob/5660d8f90f57eb1c7fdf6939346f39ab8dd772c3/SEED-Bench/evaluator/BLIP2Models.py#L163-L164

Write loss wrapper for generate or to use score

The input_ids that are given to the generate ,instead of the regular forward, are random model parameters concatenated with the image, so trying to create model.forward instance that would not get input_ids is going to be harder than just wrapping the loss for generate

idan-tankel commented 11 months ago

the output by forward are conditioned on the input_ids since:

on each step, the generated string is the truncated input_id until a certain position
The label (to compare to) is the next token of the label
The probabilities for the next token are conditioned on the input_id The input_ids are not things that helps the model to generate - they only represent the proper beginning (the label is not used for that, like "if that was the beginning, what is the expected output?")

idan-tankel commented 11 months ago

That being said, what about the hidden states of the vision part? which are some of the sequence elements of the output hidden states? (on Blip2, they are 32 out of 51)

idan-tankel / SemOOD