invictus717 / MetaTransformer

Meta-Transformer for Unified Multimodal Learning
https://arxiv.org/abs/2307.10802
Apache License 2.0
1.52k stars 114 forks source link

how to use it? #41

Closed vzapylikhin closed 1 year ago

vzapylikhin commented 1 year ago

Hello! Your transformer is amazing! But i m beginner in data science. I have to do research for my university task: we want to predict how negotiations will finish. We have various modalities including video, audio, time-series EEG. Maybe you have demo version how to use transformer for such tasks? If you do, please share it. Thanks!

invictus717 commented 1 year ago

Thank you for your interest in Meta-Transformer. You can refer to the demo code as follows:


From Data2Seq import Data2Seq

video_tokenier = Data2Seq(modality='video',dim=768)

audio_tokenier = Data2Seq(modality='audio',dim=768)

time_series_tokenier = Data2Seq(modality='time-series',dim=768)

features = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)

from timm.models.vision_transformer import Block
ckpt = torch.load("Meta-Transformer_base_patch16_encoder.pth")
encoder = nn.Sequential(*[
            Block(
                dim=768,
                num_heads=12,
                mlp_ratio=4.,
                qkv_bias=True,
                norm_layer=nn.LayerNorm,
                act_layer=nn.GELU
            )
            for i in range(12)])
encoder.load_state_dict(ckpt,strict=True)

fused_features = encoder(features)
vzapylikhin commented 1 year ago

Thanks for the answer! It seems to me that I found a small bug in the file Data2Seq.py In this code it is written:

self.embed =  Time_Series.DataEmbedding(cin = 1, d_model = self.embed_dim)

However, it should be:

self.embed =  Time_Series.DataEmbedding(c_in = 1, d_model = self.embed_dim)
jawhster commented 1 year ago

Thank you for the repo! Minor misspelling video_tokenier and audio_tokenier should be video_tokenizer and audio_tokenizer respectively. What do video and audio represent in the demo code? I'm doing something like below code but then getting error AttributeError: 'tuple' object has no attribute 'shape'. In other words, I don't know how to set video and audio to use the demo code. Thank you!

video = torchvision.io.read_video(video_path)
features = torch.concat([video_tokenizer(video)], dim=1)
vzapylikhin commented 1 year ago

Thank you for the repo! Minor misspelling video_tokenier and audio_tokenier should be video_tokenizer and audio_tokenizer respectively. What do video and audio represent in the demo code? I'm doing something like below code but then getting error AttributeError: 'tuple' object has no attribute 'shape'. In other words, I don't know how to set video and audio to use the demo code. Thank you!

video = torchvision.io.read_video(video_path)
features = torch.concat([video_tokenizer(video)], dim=1)

Hello! I've been trying to figure out Meta-Transformer for two weeks now and I can't get the text embedding I need. Did you succeed? If so, please share your code. Thanks!