but the result shows it always describe different image.
e.g.
the image id is 391895,
and the "od_label" is : "man helmet scooter sky ground ground mountain wheel man shirt mountain road mountain shoe trees rocks bike wheel post bridge grass fence jeans mirror bush grass mountains road bridge people trees tree man post bag motorcycle bridge”
but the result caption is: "a group of men standing next to each other"
the image id is 60623,
and the "od_label" is : "woman bowl woman people hair man fork wrist woman face hand watch finger spoon eye shirt woman fork person table hand flame person hair ring spoon candle glass"
but the result caption is: "a group of men sitting on top of a couch"
I follow the script from https://github.com/microsoft/Oscar/blob/master/VinVL_MODEL_ZOO.md to do inference on COCO test set for image captioning. And I use the model "coco_captioning_large_xe.zip".
but the result shows it always describe different image.
e.g. the image id is 391895,
and the "od_label" is : "man helmet scooter sky ground ground mountain wheel man shirt mountain road mountain shoe trees rocks bike wheel post bridge grass fence jeans mirror bush grass mountains road bridge people trees tree man post bag motorcycle bridge” but the result caption is: "a group of men standing next to each other"
the image id is 60623,
and the "od_label" is : "woman bowl woman people hair man fork wrist woman face hand watch finger spoon eye shirt woman fork person table hand flame person hair ring spoon candle glass" but the result caption is: "a group of men sitting on top of a couch"
Did I miss something or do something wrong?