I was going through the code to understand the architecture. I am using a Vit-B/16 model.
The downloaded model's parameters are first loaded here as a dictionary.
if not jit:
model = build_model(state_dict or model.state_dict()).to(device)
if str(device) == "cpu":
model.float()
return model, _transform(model.visual.input_resolution)
Then in model.py, the transformers' (text encoder) parameters are set according to the values of the loaded image encoder.
embed_dim = state_dict["text_projection"].shape[1]
context_length = state_dict["positional_embedding"].shape[0]
vocab_size = state_dict["token_embedding.weight"].shape[0]
transformer_width = state_dict["ln_final.weight"].shape[0]
transformer_heads = transformer_width // 64
transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith("transformer.resblocks")))
Is there any underlying reason to set the parameters of the transformer accordingly?
I was going through the code to understand the architecture. I am using a Vit-B/16 model.
The downloaded model's parameters are first loaded here as a dictionary.
Then in model.py, the transformers' (text encoder) parameters are set according to the values of the loaded image encoder.
Is there any underlying reason to set the parameters of the transformer accordingly?
Thanks in advance