microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
MIT License
19.08k stars 2.43k forks source link

Prompt Preparation of Kosmos-2 Object Detection Fine-tuning #1576

Open KevinHooah opened 2 weeks ago

KevinHooah commented 2 weeks ago

Describe Model I am using: Kosmos-2

Hi! I am working on fine-tuning the Kosmos-2 model for my own application. In short, the target may appear multiple times in the image (e.g., cars in a parking lot), and the cases can be there is only one target in the image as well.

Right now, I am preparing the dataset like following:

    if len(bboxes) > 1:
        text = "<grounding>" + "<phrase> several {target}s</phrase>"
        text = "<grounding>" + "<phrase> a {target}</phrase>"
    data_list.append({'bbox': [bboxes], 'image': image, 'text': text})

In this code, the bboxes is the human annotated bounding box, the format is list of list of tuples. The [target] is the placeholder for my target (which is a noun word.)

When I train the model with such prompts, it still output one and only one bounding box for the target, even there are multiple targets in the image.

For example, let's say the target is "car", the model will only output a bounding box for one of multiple cars in the image.

May I ask how can I solve this issue?

Note "Car" is an random example, the target is something we believe it's rare in the Kosmos-2 pre-training data.

pengzhiliang commented 1 week ago

Hello, as you know, we haven't fine-tuned this model on any specific object detection dataset, so we cannot control how many bboxes the model will generate; it could be one or multiple. Perhaps you can try some different prompts:

Describe this image in detailed. a {target} / several {targets}