I have a question in the learning process.

First of all, thank you for the good research. I was struggling with ImageBind being commercially unavailable and found a good alternative.

I have a couple of questions.

Am I correct in understanding that the concept of the paper is to patch all the data, thus incorporating the idea of ViT?
Am I correct in my understanding of the learning pipeline? For example, I'm assuming I'm using image classification and audio classification.

Method A

1) Section 3.3 Pretraining
    Train RAION-2B with ViT. (Training with clips?)
2) Section 3.3 Modality-Agnostic Learning
    ViT learns Freeze, Data2Seq Image, and Data Embedding.
3) Section 3.4 Task-specific Heads
    In the classic way, we freeze the Data2Seq part and train the Head and ViT sections respectively.
4) Done!
    You have one image classifier head and one audio  classification head!
Method B 1) Section 3.3 Pretraining Attach Data2Seq Image and Data Embedding, and train enough image and audio data through Patch Masking self-learning of ViT. 2) Section 3.4 Task-specific Heads In the classic way, we freeze the Data2Seq part and train the Head and ViT sections respectively. 3) Done! We have one image classifier head and one audio classification head!

When I think of something classical, method B looks familiar to me. As for method A, you only showed an example for Image in the paper, right? My question is, if I put audio patch data, etc. into a model trained with RAION, won't it act as an outlier because there is a chance that the model has never been trained on it? Also, since the Transformer Encoder has limited parameters, I have a hypothesis that if I don't train Data2Seq in the pretraining interval in the form of Method B, the parameters of the Encoder will not converge properly on the data for image and audio classification. (In a very small probability, there may be cases where the patch data of image and audio data are almost similar? Or maybe we're going to train Head separately anyway, so it's okay?)

If I'm completely misunderstanding, could you please provide a pseudocode for the learning?
Looking at the issue, the similarity for each modality seems to be pretty good, but if the example in question 2 is a model that classifies dog species by image and dog species by sound, will the similarity of the two Encoder Outputs be similar?

This is a really interesting paper and seems to go beyond my existing imagination and common sense. Thanks again for a great contribution.

invictus717 / MetaTransformer

I have a question in the learning process. #60