Closed 921112343 closed 4 months ago
This is because each element in the widget captioning dataset may correspond to multiple captions. For example, a delete icon can be discribe as "delete", "dust pin and water drink reminder" and "move to trash" in app screen. It's a common setting for image caption datasets that have multiple captions for one image.
Got it, Thanks for your response!
The bbox annotation in your finetune dataset (widget caption) may be wrong. Every element in the same screen get the same bbox annotation, which is impossible. For example:
{ "img_filename": "57800.jpg", "instruction": "delete", "bbox": [ 0.8395833333333333, 0.79296875, 0.9854166666666667, 0.828515625 ] }, { "img_filename": "57800.jpg", "instruction": "dust pin and water drink reminder", "bbox": [ 0.8395833333333333, 0.79296875, 0.9854166666666667, 0.828515625 ] }, { "img_filename": "57800.jpg", "instruction": "move to trash", "bbox": [ 0.8395833333333333, 0.79296875, 0.9854166666666667, 0.828515625 ] },