Closed BlueCat7 closed 2 years ago
Hi, thanks for your question.
As mentioned in our paper (footnote for section 4.1), we split LAION-115M into 5 splits and load one split per epoch. The purpose is indeed to speed up training. I agree that using more data per epoch could further improve result.
Got it. Thanks a lot.
Hi, thanks for your awesome work. I have a question about pre-training with LAION 115M dataset. I found you add more LAION dataset when increasing epochs(https://github.com/salesforce/BLIP/blob/main/data/pretrain_dataset.py#L39). I guess you want to speed up training, am i right? And how many LAION files do you split for 20 epochs. I think if training with full 115M LAION dataset from the beginning, maybe it can get more good results but consumes more days. Look forward your reply, thanks.