Performance Issues with Transferability of Generated Adversarial Images

Hello, congratulations on your outstanding work! I have been exploring the transferability of the adversarial images generated using your method, but I encountered some performance issues.

Setting

Clean Image: ImageNet Target Image: COCO image Encoder: ViT-B/32 Decoder: checkpoints/coco_cos.pt

I followed Steps 1, 2, and 4 to generate adversarial images. My goal is to verify the transferability of these generated adversarial images in the image captioning task using the InstructBLIP and LLaVA1.5 models.

Result

Despite the adversarial images being generated, the captions produced by both models still closely resemble those of the clean images. For instance, the target image's caption is:

"A man with a red helmet on a small moped on a dirt road."

However, the captions generated by InstructBLIP and LLaVA1.5 are:

# LLaVA1.5
The image features a large fish lying on a table, with a fishing rod and reel nearby.
The fishing rod is positioned on the left side of the fish, while the reel is located on the right side. 
The fish appears to be a brown and white color, and it is placed on a surface that resembles a table or a countertop. 
The scene captures the essence of a fishing experience, with the fish being caught and prepared for further use or consumption.

# InstructBLIP
This image is a digital manipulation of a fish and a fishing pole, with the fish appearing to be a large, brown fish. 
The fish is lying on a piece of cloth, possibly a tarp or a blanket, and the fishing pole is positioned next to it. 
The image is a digital manipulation, with the fish and the pole appearing to be in a distorted or blurred state.

Question

Is there another configuration or method I should use to enhance the transferability of the adversarial images? Alternatively, could you provide some adversarial images generated in your experiments that demonstrate successful transferability?

Thank you for your time and assistance.

jiamingzhang94 / AnyAttack