Open PeterDykas opened 2 years ago
Hi @PeterDykas,
Thanks for the question. We did three forward passes for images, texts, and image-text pairs given the different max length of different modality data.
Dear, @wenhui0924
I wonder how you mixed the three data when they were not of equal length.
Thanks!
The code and pre-trained models of BEiT-3 can be found at aka.ms/beit3.
When training beit3 on batches of different modalities I was wondering whether you did 3 forward passes for each type of data (image,text,image-text) for each iteration or did you batch them altogether into one forward pass?
From my understanding three separate forward passes and then calculating the loss would have the advantage that you can reduce the padding needed which may help accuracy and speed. However doing a single forward pass may be also faster since you are just doing one forward pass instead of 3.