invictus717 / MetaTransformer

Meta-Transformer for Unified Multimodal Learning
https://arxiv.org/abs/2307.10802
Apache License 2.0
1.51k stars 113 forks source link

Questions about inference #5

Closed Kelsey2018 closed 1 year ago

Kelsey2018 commented 1 year ago

hello, how to determine which modality is input during reasoning? Is a classification network used before the unimodal expert transfomer?

invictus717 commented 1 year ago

Thank you for your interest in Meta-Transformer. We propose a very easy design, specifically, we design different patch embedding layers such as S×S Conv-Flattern for the pre-trained and frozen encoder. Therefore, there is no need for an additional classification network. We will release the implementation on the Data-to-Sequence module soon.

Kelsey2018 commented 1 year ago

WOW! It's an amazing work! I am also very interested in the amount of parameters of the Data-to-Sequence module, and the specific model of Unified multimodal model, is it LLaMA or some other similar LLM model?

invictus717 commented 1 year ago

The checkpoints of the unified multimodal encoder has been released. We also provide a quick-start demo. The unified multimodal encoder consists of plain transformer blocks, which have 85M and 302M parameters for the base and large scale. More information can be found in this

Kelsey2018 commented 1 year ago

Thanks for telling me the parameter amount of the Data-to-Sequence module(I just noticed you already wrote in readme)! Another question is whether the flatten output dimensions of the meta scheme are the same? How to deal with the problem of alignment after unpaired input embedding?

invictus717 commented 1 year ago

Apologies for any confusion earlier. To clarify, the Data-to-Sequence module has a relatively small parameter count, approximately 10K-20K. In contrast, the Meta-Transformer encoders are with 85M parameters for the base scale and 302M for the large scale. Even though, Meta-Transformer is small compared with LLMs.

Kelsey2018 commented 1 year ago

Thanks for your patience in answering my questions, and I am looking forward to your open source complete code : )

Kelsey2018 commented 1 year ago

Hi, sorry to ask again, I wonder exactly what model LLM is, and how many parameters does it have?

invictus717 commented 1 year ago

Usually, the LLMs refer to Large Language Models, and they contain 7B, 13 B, and 70 B parameters, where ‘B’ denotes Billion.

Kelsey2018 commented 1 year ago

Okay! Which LLM model is used in your experiment? LLaMA or Vicuna?

Kelsey2018 commented 1 year ago

Sorry for some misunderstanding earlier, you did not use LLM in meta-transformer, but the structure called the unified encoder, and you freeze the parameter when training.

Kelsey2018 commented 1 year ago

Hi, sorry to bother again! In the experiments(chapter 4), what does "the number of trainable parameters" meaning? And Why does this parameter vary with modalities?

invictus717 commented 1 year ago

We typically employ the gradient descent algorithm to update model parameters. However, if we want to freeze certain parameters and prevent them from being updated, we can set the requires_grad attribute of the tensor to False. As for different modalities, our approach leverages tokenizers of varying dimensions (high or low) and unique task-specific heads. Both of these factors contribute to the differences in the number of trainable parameters.

Kelsey2018 commented 1 year ago

Wow! So how to choose different task-specific heads according to different down-stream tasks?