Tips for fine-tune - Githubissues

wanghao9610 / OV-DINO

Official implementation of OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

https://wanghao9610.github.io/OV-DINO

Apache License 2.0

234 stars 11 forks source link

Tips for fine-tune #26

Closed Fiorentinar closed 2 months ago

Fiorentinar commented 2 months ago

Thank you very much for your work!

I currently have a dataset with the following characteristics: high resolution(1920*1080), very limited data amount(~500 frames), and specific category names( just like 'bottom of trunk' or 'top of the traffic sign'). Specifically, these are all small objects for detection. It can be said that the dataset differs significantly from all the pre-trained datasets.

My question is that if I want to use OV-DINO for fine-tuning, are there any techniques or tips to help me bridge the gaps in resolution, data amount, object size, and category names (or just bridge some of these gaps)? Thanks a lot!

wanghao9610 commented 2 months ago

Hello @Fiorentinar , following are some advices for fine-tuning:

More and larger input resolutions, OV-DINO default is [800, 1333], change them in custom_ovd.py.
Fixe the text encoder, as your fine-tuning data scale is small, refer to issue 21.
Keep the category name as simple as possible, as the pre-training category names are nouns or phrase of a single class, complicated text description may not matchd in the pre-training stage. For certain perfomance, you need to run experiments to test it.

Other settings no need to change.

Fiorentinar commented 2 months ago

Thank you very much for the tips you've provided. I'm particularly curious about the following two points:

In the scenario I described, would fixing the text encoder affect the fine-tuning of category labels(for those labels that are not commonly present in pre-trained dataset)?
For small object detection, do I need to modify the corresponding network structure, just similar to the changes from YOLOv8 to YOLOv8-p2 (where lower-dimensional backbone features are also fed into the neck and head)?

wanghao9610 commented 2 months ago

@Fiorentinar Regarding your question:

You can run experiments with fixed and unfixed text encoders, and better performance is expected.
Modifying the framework will not load the pre-trained model, which may lead to worse results.

Fiorentinar commented 2 months ago

Thank you for your patient response; all my questions have been answered.