invictus717 / MetaTransformer

Meta-Transformer for Unified Multimodal Learning
https://arxiv.org/abs/2307.10802
Apache License 2.0
1.52k stars 114 forks source link

Explain please #52

Closed vzapylikhin closed 1 year ago

vzapylikhin commented 1 year ago

Hello! I've been trying to figure out Meta-Transformer for two weeks now and I can't get the embeddings I need. Please share the code for the following example: how to get text and image embeddings from the words "dog", "car", "bird" and their pictures. Thanks!

jawhster commented 1 year ago

I think that I've been able to get a text embedding from MetaTransformer (if I am doing it correctly). Code below and from README. Note that may need some installs and see on README about download link to .pth file. Also the zero_padding function is from Data2Seq > Text.py which I just recreated the function but did not include below which you can. I have not been able to get any other mode yet but will continue working on it.

# For base-scale encoder:
ckpt = torch.load("Meta-Transformer_base_patch16_encoder.pth")
encoder = nn.Sequential(*[
            Block(
                dim=768,
                num_heads=12,
                mlp_ratio=4.,
                qkv_bias=True,
                norm_layer=nn.LayerNorm,
                act_layer=nn.GELU
            )
            for i in range(12)])
encoder.load_state_dict(ckpt,strict=True)
import clip
model, preprocess = clip.load('ViT-B/32', "cpu")
text_tensor = clip.tokenize("is it working?")
encoding = model.encode_text(text_tensor)
encoding = zero_padding(encoding, 768, "cpu")
encoded_features_text = encoder(encoding.unsqueeze(dim=1))
vzapylikhin commented 11 months ago

@jawhster Hello! We are continuing to study Metatransform project, probably you have some ideas about my new question. I will be really grateful. I would like to know when we combine tokens of two modalities, such as text (1 x 768) and pictures (1 x 768), then we get a token using torch.concat (2x768).

features = torch.concat([text_tokenizer(text), image_tokenizer(image)],dim=1)

Then, after the encoder, we get a 2x768 embedding. But since both the text and the picture belong to the same data line, we would like to get a 1x768 embedding. How to do it?