invictus717 / MetaTransformer

Meta-Transformer for Unified Multimodal Learning
https://arxiv.org/abs/2307.10802
Apache License 2.0
1.52k stars 114 forks source link

about Modality-Agnostic Models #72

Closed regainOWO closed 2 months ago

regainOWO commented 2 months ago

Hi! Thanks for your great contributions! I want to know how the Meta-Transformer-B16 and Meta-Transformer-L14 model files are trained?. I found its the Transformer Block weights.

invictus717 commented 2 months ago

They're simple pretrained CLIP models on the LAION-2B dataset, we only use the transformer blocks and use the proposed modality-specific tokenizers in our paper among these modalities.

regainOWO commented 2 months ago

Thanks for your reply! what about Image_Meta-Transformer-B16, Is it obtained by training Meta-Transformer-B16 as ViT on ImageNet-1K dataset? image

invictus717 commented 2 months ago

It's a finetuned Meta-Transformer weight on the image datasets.