openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
24.7k stars 3.21k forks source link

Why the text encoder i.e. the transformer's parameters are set according to the Image Encoder used? #384

Open abhishek-topwal opened 1 year ago

abhishek-topwal commented 1 year ago

I was going through the code to understand the architecture. I am using a Vit-B/16 model.

The downloaded model's parameters are first loaded here as a dictionary.

   if not jit:
        model = build_model(state_dict or model.state_dict()).to(device)
        if str(device) == "cpu":
            model.float()
        return model, _transform(model.visual.input_resolution)

Then in model.py, the transformers' (text encoder) parameters are set according to the values of the loaded image encoder.

    embed_dim = state_dict["text_projection"].shape[1]
    context_length = state_dict["positional_embedding"].shape[0]
    vocab_size = state_dict["token_embedding.weight"].shape[0]
    transformer_width = state_dict["ln_final.weight"].shape[0]
    transformer_heads = transformer_width // 64
    transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith("transformer.resblocks")))

Is there any underlying reason to set the parameters of the transformer accordingly?

Thanks in advance