microsoft / GLIP

Grounded Language-Image Pre-training
MIT License
2.2k stars 194 forks source link

Custom Dataset - Some guidance #24

Closed fernandorovai closed 2 years ago

fernandorovai commented 2 years ago

Hi there, I'm confused with the terms tokens_positive / tokens_negative and the image caption itself.

What should be the image caption if I have multiple objects with different attributes on the same image? For instance: Pink elephant, blue elephant and normal elephant on the same image. Image caption on annotation file should be: blue elephant,normal elephant,pink elephant? For the boxes, for each elephant should I have the tokens_positive as the correspondent elephant? ex: for blue elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [0,13] } ex: for normal elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [14,29] } ex: for pink elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [30,44] } "categories": [{ "supercategory": "animal", "id": 1, "name": "elephant" }]

Do you know any guide for creating the dataset? Thanks!

Haotian-Zhang commented 2 years ago

@fernandorovai Thank you for the question. To create a dataset, you can start creating it by taking a look at the Flickr30k-entities ground truth file, which is downloaded from MDETR annotations and follow its format. We only use the tokens_positive here. If you want to try your own data with image and text pair only, I recommend you to use the NER to extract the positive tokens, and here are some code pieces used in our repo that can be referred to: https://github.com/microsoft/GLIP/blob/fd52c6361f013e70ae7682d90b3ab3ca2bd5e6bc/maskrcnn_benchmark/engine/predictor_glip.py#L107-L127.

fernandorovai commented 2 years ago

@Haotian-Zhang thanks so much for the instructions and for this awesome project. In case I have multiple objects of the same class but with different attributes (the elephant example) in the same image, should I add the same image multiple times with different captions for each one of the objects?

Haotian-Zhang commented 2 years ago

@fernandorovai Thank you for your question and support. In fact, for the contextualized detection (grounding) problem, we don't actually have a concept of "class". You only need one caption for a single image, for e.g. if you have an image ended up with its caption: "Pink elephant, blue elephant and normal elephant". The input label will be just like, positive spans = [[start char position for "pink elephant": end char position for "pink elephant"], [start char position of "blue elephant", end char position of "blue elephant"], [start char position of "normal elephant", end char position of "normal elephant"]]. (Assume the noun phrases are "pink elephant", " blue elephant" and "normal elephant").

Hope that helps! Thanks!

fernandorovai commented 2 years ago

@Haotian-Zhang Sorry for my ignorance and thanks for all the guidance!

Haotian-Zhang commented 2 years ago

@fernandorovai No worries at all! I'll close the issue now, please feel free to reopen it if you have further questions.