microsoft / GLIP

Grounded Language-Image Pre-training
MIT License
2.2k stars 194 forks source link

How to use prompt turning(language_prompt_v2) in custom datasets? #7

Closed bushou-yhh closed 2 years ago

bushou-yhh commented 2 years ago

Hi, Thanks for your sharing this nice work! I would like to confirm the following question:

Finally, thanks for this great work!

liunian-harold-li commented 2 years ago

Hi, thank you for raising this great question!

  1. Difference between 4 kinds of prompt tuning ways

v2 is the most "literal" prompt tuning, in that it will freeze both the language model and the rest of the model; add a bias embedding vector to the prompt embedding; and we will just tune the bias vector (https://github.com/microsoft/GLIP/blob/e496d64141086fbd5f064ee6f4a9b7427f0be101/maskrcnn_benchmark/modeling/rpn/vldyhead.py#L914). Since we are only optimizing this bias vector, v2 usually requires some hyper-parameter search, with usually larger learning rate and weight decay. Please consider tuning the learning rate with respect to your batch size.

v1 is a more relaxed prompt tuning, in that we will tune the whole language model (BERT) during prompt tuning. We used v1 mainly as a way to test our idea, as it is easier to optimize. You can just use mostly the same hyper-parameters as fine-tuning for v1. Try setting SOLVER.LANG_LR between 1e-5 to 5e-5.

If v2 does not work for you out of the box, we recommend trying v1 first and tune v2; conceptually, v1 and v2 should have similar performance if the hyper-parameters of v2 are tuned right.

v3 is v1 + unfreezing a few more linear layers as in linear probing (e.g., the centerness branch); v4 is v2 + unfreezing a few more linear layers. Practically, we find them to be similar to v1 & v2.

  1. Prompt tuning performance could also depend on your custom dataset. For some datasets that are very different from the pre-training dataset, we do find that prompt tuning could underperform full-tuning (you could check the detailed per-dataset stats in the paper Appendix, Table 14 & 15).

  2. Manually prompts matters the most for zero-shot evaluation Could you share the statistics of your custom dataset? My experience is that as long as the class names make sense, with a moderate amount of data, the model should be able to learn well without good manual prompts.

In fact, prompt tuning is the "automated way to set prompt" embeddings from my understanding. It's similar to the soft prompt approach in Lester et al.

bushou-yhh commented 2 years ago

Hi author,thanks for your answers and suggestions!! I will have a try!! I just learned about multimodel and vision luanguage model. Many parts of GLIP confuse me. My custom dataset is VisDrone, define ten object categories of interest including pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. VisDrone contains 10,209 static images (6,471 for training, 548 for validation and 3,190 for testing) captured by drone platforms in different places at different height.
the table has more information: image when I change category name : motor --> motorcyle, map: 0.044842252 --> 0.1143073, It's amazing!!! (zero shot, pre-train model: glip_tiny_model_o365_goldg.pth). I add a subfix "who maintains standing pose or walking" for pedestrian, also have a large improve(0.019298928 -> 0.084613924). When I use full turning, better prompt also get a better result. Now, I can not understand the "automated way to set prompt" . I need to know these things about prompt. I also noticed some prompt ways, like CoOP, CoCoOP, for CLIP model. The prompt tuning in GLIP, whether to generate some vector during traning, and find some vector to get a best result?

Addition: Here are two same rows in tools/finetune.py image

DietDietDiet commented 1 year ago

@bushou-yhh hi, I'm just wondering if your result of prompt tuning a vl model on Disdrone dataset is comparable to the normal finetune process in pure visual modality, do you have any clues?

bushou-yhh commented 1 year ago

hi, I'm just wondering if your result of prompt tuning a vl model on Disdrone dataset is comparable to the normal finetune process in pure visual modality, do you have any clues?

In my practice, VLMs work much better than pure vision models (with the same settings, no patch, etc.), GLIP's pre-training to get a good concept that transfer to Visdrone works well, and at the same time GLIP can do open-vocabulary OD, which is worth exploring (But I don't do aerial imagery anymore, the knowledge is probably outdated)

DietDietDiet commented 1 year ago

@bushou-yhh Thanks for your infomation! so are you using VLM as pretrained model to finetune visual tasks or finetuning as a grounding task?