microsoft / GLIP

Grounded Language-Image Pre-training
MIT License
2.18k stars 191 forks source link

Can I use bigger batch size in inference.py ? #97

Open CrapbagMo opened 1 year ago

CrapbagMo commented 1 year ago

hi, thanks for your great work. In inferece.py ,

 if task == "detection":
                        captions = [all_queries[query_i] for ii in range(len(targets))]
                        positive_map_label_to_token = all_positive_map_label_to_token[query_i]
                    elif task == "grounding":
                        captions = [t.get_field("caption") for t in targets]
                        positive_map_eval = [t.get_field("positive_map_eval") for t in targets]
                        if cfg.MODEL.RPN_ARCHITECTURE == "VLDYHEAD":
                            plus = 1
                        else:
                            plus = 0
                        assert(len(positive_map_eval) == 1) # Let's just use one image per batch
                        positive_map_eval = positive_map_eval[0]
                        positive_map_label_to_token = create_positive_map_label_to_token_from_positive_map(positive_map_eval, plus=plus)
                    output = model(images, captions=captions, positive_map=positive_map_label_to_token)
                    output = [o.to(cpu_device) for o in output]

can you please explain why # Let's just use one image per batch ? Can i use a bigger batch size?

liunian-harold-li commented 1 year ago

Hi, this is because for inference, we need to provide this positive_map_label_to_token field, which basically specifies from which token positions we want to predict boxes. This field is different for different prompts and dealing with multiple positive_map_label_to_token was a bit cumbersome so we used only batch_size = 1.

JJ-xiaomao commented 1 year ago

Hi, this is because for inference, we need to provide this positive_map_label_to_token field, which basically specifies from which token positions we want to predict boxes. This field is different for different prompts and dealing with multiple positive_map_label_to_token was a bit cumbersome so we used only batch_size = 1.

I want to use two GPUs to run inference, Can I set the image per batch to 2, and then set the world_size to 2, so that the image per batch on each GPU is still 1 ?

I have tried it and it works. It would be even better if the batch size could be increased during grounding.