sail-sg / ptp

[CVPR2023] The code for 《Position-guided Text Prompt for Vision-Language Pre-training》
https://arxiv.org/abs/2212.09737
Apache License 2.0
148 stars 4 forks source link

Training time and grid pseudo label extracting time #5

Closed 9115jin closed 1 year ago

9115jin commented 1 year ago

Hello, I saw the results of your paper and they were truly outstanding. I have a few questions.

  1. Could you tell me how long it takes to do pretraining and fine-tuning for the coco image-to-text retrieval?
  2. Also, from what I read in your paper, obtaining the grid pseudo label using CLIP takes around 8 hours. Could I understand that the grid pseudo label is a corpus that is extracted to provide positional information through prompts?

Thank you😁!

FingerRec commented 1 year ago

Hi 9115jin:

  1. The training time is include in the training logs. For example, on 8 NVIDIA A100 GPUs: The pretrain time is Training time 1 day, 2:32:57. The ft time for coco retrieval is Training time 6:53:15.

  2. Exactly. Use CLIP feedforward only to extract most similar keywords/phrases is very fast.

9115jin commented 1 year ago

Thank you for your prompt and accurate response!

I'm planning to start researching image-to-text retreival(TR), and i believe your PTP-BLIP project will be very helpful.