Table1 - Githubissues

Hi @geek-APTX4869! Thank you for your interest, and apologies for the delay in responding—I wasn’t receiving notifications about activity in the repository.

To reproduce the results in Table 1, we perform the following steps:

Dataset Preparation: You’ll need to create synthetic datasets using Stable Diffusion or a similar image generation model. In our case:
- For VOC-sim, we generated 600 images using prompts with the template "A photograph of a 〈class-name〉" for each model. To control visual variability, we used the same random seeds across all architectures.
- For COCO-cap, we generated images based on complex captions from the COCO dataset, following similar schema from previous literature.
Ground Truth: We manually annotated the ground truth masks using CVAT software to avoid bias introduced by other segmentation models. While some other works generate ground truth via model-based segmentation, we opted for human annotation for higher fidelity.
Mask Extraction: We generated masks using each method evaluated. The process generally involved:
Extracting the attentions associated with the class-related word in the prompt. Applying preprocessing and thresholding to convert the attention maps into binary masks following the different methods. For token optimization, instead of using the prompt-word embedding, we utilized a token optimized on a separate image (not included in the evaluation set). This method allows us to extract attentions for words not included in the prompt used for image generation.
Evaluation: We computed the mIoU between the ground truth masks and the binary masks generated by each method.

You can find a link to the dataset with generated images and annotations for evaluation in the README. Please feel free to reach out if you have further questions!

vpulab / ovam

Table1 #3