Closed trouble-maker007 closed 7 months ago
@xukunxkxk Thank you for your reply. In point three, you mentioned that it would directly fill up the length of 2048. So, if the last filled sentence or the token id of the image exceeds 2048, will the exceeding ids be directly discarded?
No, We use the exceeding ids in the next sample. This strategy is very common in LLM pretraining.
1.The paper mentioned that the training data format includes [text, image] and [image, text]. I would like to ask, is each line of the training data in the uniform format of [text], [text, image], [image, text]? 2.Is the multimodal data used open-source? 3.What's the maximum training length of LAViT? Is it filled up with a maximum sequence length of multiple data like [text, image]?