jefferyZhan / Griffon

【ECCV2024】The official repo of Griffon series
Apache License 2.0
105 stars 6 forks source link

Demo not working #17

Open VIXIXIVIIIX opened 2 weeks ago

VIXIXIVIIIX commented 2 weeks ago

When I run those three demo

Localize Single Referent

bash demo/demo.sh demo/1v1.jpg "Is there a motorcycle on the far left of the photo?"

Multi Categories with Multi Objects

bash demo/demo.sh demo/nvn.jpg "Examine the image for any objects from the category set. Report the coordinates of each detected object. The category set includes person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, fire hydrant, stop sign, parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe, backpack, umbrella, handbag, tie, suitcase, frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket, bottle, wine glass, cup, fork, knife, spoon, bowl, banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake, chair, couch, potted plant, bed, dining table, toilet, tv, laptop, mouse, remote, keyboard, cell phone, microwave, oven, toaster, sink, refrigerator, book, clock, vase, scissors, teddy bear, hair drier, toothbrush. The output format for each detected object is class-name-[top-left coordinate, bottom-right coordinate] e.g. person-[0.001, 0.345, 0.111, 0.678]. Concatenate them with &."

One Categories with Multi Objects

bash demo/demo.sh demo/1vn.jpg "In this picture, identify and locate all the people in the front."

only the second Multi Categories with Multi Objects can give the right output format, but the result is wrong

adde881e0df3eb80af0e2d31d2f0768

When I run the first and the third command, the output like this fc8d4cf2b651477a4697d2eaec8742f

JHYsama commented 2 weeks ago

@oyzh-oyzh Your ViT has not been modified according to his instructions yet.

VIXIXIVIIIX commented 2 weeks ago

Hi @JHYsama Thanks for replying me. I am facing the same question like the issuse #18 . I do run the resize script, but when I copy the config file, there are something wrong like issuse #18 . Could you tell me how to modified the VIT correctly?

jefferyZhan commented 2 weeks ago

Hi, could you please check the position embedding size in your modified ViT model checkpoint? If you met the same error as issue #18 , it indicates you are still using the ViT of 224. We'll update the inference script to avoid manual modification together with the G version model later.

VIXIXIVIIIX commented 2 weeks ago

Thank for your replying. I just solved the problem by setting ignore_mismatched_sizes=True on llava.model.multimodal_encoder.clip_encoder