jy0205 / LaVIT

LaVIT: Empower the Large Language Model to Understand and Generate Visual Content
Other
438 stars 22 forks source link

Questions about the format of the training data #5

Closed trouble-maker007 closed 7 months ago

trouble-maker007 commented 7 months ago

1.The paper mentioned that the training data format includes [text, image] and [image, text]. I would like to ask, is each line of the training data in the uniform format of [text], [text, image], [image, text]? 2.Is the multimodal data used open-source? 3.What's the maximum training length of LAViT? Is it filled up with a maximum sequence length of multiple data like [text, image]?

xukunxkxk commented 7 months ago
  1. For a single training sample, we only choose one of the format [text], [text, image], [image, text]
  2. Yes, all multimodal/text data we used is open-source data. You can find the description in our paper.
  3. The max length is 2048 . For efficiency we filled up with a maximum sequence length, so there is no padding in pretraining samples.
trouble-maker007 commented 7 months ago

@xukunxkxk Thank you for your reply. In point three, you mentioned that it would directly fill up the length of 2048. So, if the last filled sentence or the token id of the image exceeds 2048, will the exceeding ids be directly discarded?

xukunxkxk commented 7 months ago

No, We use the exceeding ids in the next sample. This strategy is very common in LLM pretraining.