Training time and grid pseudo label extracting time

sail-sg / ptp

[CVPR2023] The code for 《Position-guided Text Prompt for Vision-Language Pre-training》

https://arxiv.org/abs/2212.09737

Apache License 2.0

148 stars 4 forks source link

Closed 9115jin closed 1 year ago

9115jin commented 1 year ago

Hello, I saw the results of your paper and they were truly outstanding. I have a few questions.

Could you tell me how long it takes to do pretraining and fine-tuning for the coco image-to-text retrieval?
Also, from what I read in your paper, obtaining the grid pseudo label using CLIP takes around 8 hours. Could I understand that the grid pseudo label is a corpus that is extracted to provide positional information through prompts?

Thank you😁!

FingerRec commented 1 year ago

Hi 9115jin:

The training time is include in the training logs. For example, on 8 NVIDIA A100 GPUs: The pretrain time is Training time 1 day, 2:32:57. The ft time for coco retrieval is Training time 6:53:15.
Exactly. Use CLIP feedforward only to extract most similar keywords/phrases is very fast.

9115jin commented 1 year ago

Thank you for your prompt and accurate response!

I'm planning to start researching image-to-text retreival(TR), and i believe your PTP-BLIP project will be very helpful.