Why the text encoder i.e. the transformer's parameters are set according to the Image Encoder used?

I was going through the code to understand the architecture. I am using a Vit-B/16 model.

The downloaded model's parameters are first loaded here as a dictionary.

   if not jit:
        model = build_model(state_dict or model.state_dict()).to(device)
        if str(device) == "cpu":
            model.float()
        return model, _transform(model.visual.input_resolution)

Then in model.py, the transformers' (text encoder) parameters are set according to the values of the loaded image encoder.

    embed_dim = state_dict["text_projection"].shape[1]
    context_length = state_dict["positional_embedding"].shape[0]
    vocab_size = state_dict["token_embedding.weight"].shape[0]
    transformer_width = state_dict["ln_final.weight"].shape[0]
    transformer_heads = transformer_width // 64
    transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith("transformer.resblocks")))

Is there any underlying reason to set the parameters of the transformer accordingly?

Thanks in advance

openai / CLIP

Why the text encoder i.e. the transformer's parameters are set according to the Image Encoder used? #384