Closed vzapylikhin closed 1 year ago
I think that I've been able to get a text embedding from MetaTransformer (if I am doing it correctly). Code below and from README. Note that may need some installs and see on README about download link to .pth file. Also the zero_padding function is from Data2Seq > Text.py which I just recreated the function but did not include below which you can. I have not been able to get any other mode yet but will continue working on it.
# For base-scale encoder:
ckpt = torch.load("Meta-Transformer_base_patch16_encoder.pth")
encoder = nn.Sequential(*[
Block(
dim=768,
num_heads=12,
mlp_ratio=4.,
qkv_bias=True,
norm_layer=nn.LayerNorm,
act_layer=nn.GELU
)
for i in range(12)])
encoder.load_state_dict(ckpt,strict=True)
import clip
model, preprocess = clip.load('ViT-B/32', "cpu")
text_tensor = clip.tokenize("is it working?")
encoding = model.encode_text(text_tensor)
encoding = zero_padding(encoding, 768, "cpu")
encoded_features_text = encoder(encoding.unsqueeze(dim=1))
@jawhster Hello! We are continuing to study Metatransform project, probably you have some ideas about my new question. I will be really grateful. I would like to know when we combine tokens of two modalities, such as text (1 x 768) and pictures (1 x 768), then we get a token using torch.concat (2x768).
features = torch.concat([text_tokenizer(text), image_tokenizer(image)],dim=1)
Then, after the encoder, we get a 2x768 embedding. But since both the text and the picture belong to the same data line, we would like to get a 1x768 embedding. How to do it?
Hello! I've been trying to figure out Meta-Transformer for two weeks now and I can't get the embeddings I need. Please share the code for the following example: how to get text and image embeddings from the words "dog", "car", "bird" and their pictures. Thanks!