Closed vzapylikhin closed 1 year ago
Thank you for your interest in Meta-Transformer. You can refer to the demo code as follows:
From Data2Seq import Data2Seq
video_tokenier = Data2Seq(modality='video',dim=768)
audio_tokenier = Data2Seq(modality='audio',dim=768)
time_series_tokenier = Data2Seq(modality='time-series',dim=768)
features = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)
from timm.models.vision_transformer import Block
ckpt = torch.load("Meta-Transformer_base_patch16_encoder.pth")
encoder = nn.Sequential(*[
Block(
dim=768,
num_heads=12,
mlp_ratio=4.,
qkv_bias=True,
norm_layer=nn.LayerNorm,
act_layer=nn.GELU
)
for i in range(12)])
encoder.load_state_dict(ckpt,strict=True)
fused_features = encoder(features)
Thanks for the answer! It seems to me that I found a small bug in the file Data2Seq.py In this code it is written:
self.embed = Time_Series.DataEmbedding(cin = 1, d_model = self.embed_dim)
However, it should be:
self.embed = Time_Series.DataEmbedding(c_in = 1, d_model = self.embed_dim)
Thank you for the repo! Minor misspelling video_tokenier and audio_tokenier should be video_tokenizer and audio_tokenizer respectively. What do video and audio represent in the demo code? I'm doing something like below code but then getting error AttributeError: 'tuple' object has no attribute 'shape'
. In other words, I don't know how to set video and audio to use the demo code. Thank you!
video = torchvision.io.read_video(video_path)
features = torch.concat([video_tokenizer(video)], dim=1)
Thank you for the repo! Minor misspelling video_tokenier and audio_tokenier should be video_tokenizer and audio_tokenizer respectively. What do video and audio represent in the demo code? I'm doing something like below code but then getting error
AttributeError: 'tuple' object has no attribute 'shape'
. In other words, I don't know how to set video and audio to use the demo code. Thank you!video = torchvision.io.read_video(video_path) features = torch.concat([video_tokenizer(video)], dim=1)
Hello! I've been trying to figure out Meta-Transformer for two weeks now and I can't get the text embedding I need. Did you succeed? If so, please share your code. Thanks!
Hello! Your transformer is amazing! But i m beginner in data science. I have to do research for my university task: we want to predict how negotiations will finish. We have various modalities including video, audio, time-series EEG. Maybe you have demo version how to use transformer for such tasks? If you do, please share it. Thanks!