salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.8k stars 641 forks source link

minimizing difference across image and text features #4

Open dribnet opened 2 years ago

dribnet commented 2 years ago

Took a stab and creating "BLIP guided imagery" using VQGAN. The general idea is that you start with a reference text embedding and then steer an image to minimize the angle between its embedding and the reference text embedding. I coded this up, but there seemed to be no relationship between the encoded text and the resulting image. For example, this is a "sunset":

output

This leads me to believe that the features spaces are not aligned as they are in CLIP. This seems to be confirmed with the large model which has different sized vectors: the image_features are now of size 1024 while the text features are still size 768.

Is my assumption correct and if so is there a simple transformation or mapping between the feature spaces? One caveat is that I don't fully understand the shape of the returned features and so am simply extracting the first element as was done by the Feature Extraction example in the demo notebook.

dribnet commented 2 years ago

update: I've discovered ITC from Li et al. and am looking into this loss mechanic. https://arxiv.org/abs/2107.07651

LiJunnan1992 commented 2 years ago

Hi, thanks for your interest in BLIP!

The feature extraction example is not a suitable feature for measuring image-text similarity. I would suggest to do the following changes:

  1. Could you please follow models/blip_pretrain.py and add these lines to the model's init function. The image and text features need to go through another projection layer before they can be used to compute a cosine similarity.

        self.vision_proj = nn.Linear(vision_width, embed_dim)
        self.text_proj = nn.Linear(text_width, embed_dim)
  2. Different from CLIP, BLIP has an image-text matching (ITM) head which is much better at computing image-text similarity. ITM uses cross-attention to fuse image and text features, which can capture finer-grained similarity compared to the simple cosine similarity function (used by the ITC loss and CLIP). In order to compute the ITM score, you can first extract the multimodal feature, then pass it to a self.itm_head, which produces a two-dimensional vector that represents the scores for [negative, positive].

I highly recommend you to try out the ITM score, you could either use our pre-trained model or the model that is finetuned on COCO for image-text retrieval.

Let me know if you need more help or a demo code for the ITM score. Looking forward to see BLIP shine for VQGAN!

dribnet commented 2 years ago

Thanks Junnan for that explanation!

Would certainly use any "text <-> image" scoring example you might have as a reference, but I can also give it a go myself over the next few days. I have a pipeline which makes it easy to "swap out" the perception models (BLIP vs CLIP or SLIP) so it will be very interesting to see visually if BLIP can capture any finer grain details from various text descriptions.

LiJunnan1992 commented 2 years ago

I have updated the demo with a new section to compute image-text matching scores, using either the ITM head (w/ cross attention) or the ITC head (feature cosine similarity). Let me know if you have any questions!

LiJunnan1992 commented 2 years ago

Hi @dribnet, just curious, does the ITM similarity work for VQGAN?

dribnet commented 2 years ago

Yes - we have some preliminary results but need to clean up the code a bit. Will post results here. 👍

dribnet commented 2 years ago

We have a version of BLIP loss this we plan on adding to an upcoming release. So far in our testing the BLIP guided loss works but doesn't "outperform" CLIP on most subjects. However there can be good effects combining it with CLIP loss. So for example, starting with a baseline of a CLIP + diffusion result for "a blueprint of a steampunk submarine" that looks like this.

output

The result with a pure BLIP model (in this case flickr base) isn't generally better:

output

But when we combine the CLIP result with additional BLIP loss, then we often do get enhancements from the CLIP version:

output

So a bit hand wavey - but those are our first impressions of generating imagery with ITM similarity. I think a lot of this is also based on the dataset - both the subject matter and the formatting of the captions - so perhaps if we reviewed the training sets for these models could find specific prompts that would better match the training distribution.

Thanks again for including this ITM demo in your notebook as a basis for these experiments! Feel free to close the issue if you'd like and I'll followup when this is released in case anyone wants to try their own prompts out.

LiJunnan1992 commented 2 years ago

Thanks for the detailed update! I have few questions:

  1. Have you tried using ITC similarity from BLIP, and combining BLIP's ITC and ITM loss? The pre-trained BLIP checkpoint w/ ITC should function in the same way as CLIP, since CLIP only uses ITC for pre-training.

  2. The advantage of ITM is that it can capture finer-grained details in longer captions. Therefore, it may be interesting to compare BLIP w/ ITM to CLIP using longer texts. Also, since the fine-tuned BLIP checkpoints (e.g. coco_retrieval_base) use natural-scene images for fine-tuning, I wonder how does BLIP perform for long texts that describe natural scenes (e.g. a little girl in yellow shirt is playing with a dog in the backyard).

  3. May I know how you use the ITM to guide image generation? Do you compute the cross-entropy loss after the softmax, or do you take the logit before softmax and guide the image to increase the positive logit's score? I guess these two approaches will yield different results. Furthermore, since ITM uses all the image patches' features (rather than only a global image feature as CLIP does), I wonder if this difference will affect the generation results?

Looking forward to the release of the code!

dribnet commented 2 years ago

Thanks for the feedback! Glad to work on this a bit more with you if you are interested. In response to your questions:

1) This current version uses ITM only - we did not yet try using the ITC because (IIUC) this would involve modifying the pretrained models and then re-training to essentially fine-tune these new layers. But would still be up for trying this as it would also be much easier to plug into our system which is currently built around the semantics of CLIP (and SLIP).

2) Yes - I agree that better matching the text queries to the training distribution would help show BLIP's potential. Here's a quick test using your test sentence:

Finetuned checkpoints:

Model "a little girl in yellow shirt is playing with a dog in the backyard"
CLIP ViT-B/16 output
Coco BLIP ViT-B/16 output
CLIP ViT-B/16 + Coco BLIP ViT-B/16 output

Hard to read the tea leaves on just one run, but it seems CLIP is doing better with BLIP's input. I'll can put this online in a web interface so can try out your own text sentences if you are interested.

3) Yes, feel free to check out our current implementation here - @samedii did a number of tests with different loss options. But it's certainly possible that there are issues with this current working version of the loss that are negatively impacting the result.

One other thing I forgot to mention was that BLIPs larger input size (384x384) is a another nice feature relative to CLIPs 224x224 (note some CLIP models do go up to 448 - but are very memory intensive). So doing larger images might also show off some of BLIPs capabilities relative to CLIP. Here's a version of "a blueprint of a steampunk submarine" BLIP+CLIP that's closer to to this resolution and my hunch is that there are more fine details emerge thanks to BLIP's input.

output

LiJunnan1992 commented 2 years ago

Super interesting!

After checking your code, it seems that both BLIP's ITC head and ITM head are used to compute the spherical_distance_itc loss and itm_loss loss.

  1. My hunch is that using softmax for itm_loss may not be as effective as directly using the raw logit. Instead of increasing the positive logit, the model can also minimize the softmax loss by reducing the negative logit, which may produce a weaker signal to the image generator.

  2. Are there any reason for using spherical_distance_itc instead of the cosine distance?

  3. I notice that there seem to be multiple text prompts, curious about what they are.

samedii commented 2 years ago

I've tested this a bit when I was implementing it. I got the best results with spherical_distance_itc (but cosine similarity itc was very similar).

I kept the softmax itm loss too in my final loss. I decided to optimize the softmax though since it doesn't behave that well with logits and it felt better to "cap" it's effect. Mostly kept it in because I had spent the time implementing it... :)

https://github.com/pixray/pixray/blob/5a640eb56446072e41d3b885c3a082cd71588882/Losses/BLIPLoss.py

return (spherical_distance_itc + itm_loss) / 2

https://github.com/samedii/perceptor/blob/master/perceptor/losses/blip.py

        image_embeddings, image_encodings_itc = self.model.encode_images(images)
        return (
            self.model.image_text_captioning_spherical_distance(
                image_encodings_itc, self.encodings
            ).mean()
            * 0.9
            + self.model.image_text_retrieval_probabilities(
                self.tokenized_texts, image_embeddings
            ).mean()
            * 0.1
        )

I'm getting good results with only BLIP models like this (without any CLIP models involved)

LiJunnan1992 commented 2 years ago

Interesting!

I guess the itm_head without softmax can indeed make optimization difficult because the logit is unbounded. The itc_loss could provide some complementary signal that regularizes the itm_loss.

btw, I wrote 'itc' to represent the 'image-text contrastive' loss during pre-training :)

samedii commented 2 years ago

Very interesting indeed, thanks for letting us try out your work! :) Yes, I'm also hoping it will sometimes improve things but it's of course very hard to evaluate if it's working that way

I see, then I was fooled! :D Will change that