Better prompting might give better results at 0 shot

tohrnii commented 8 months ago

I ran the experiment with a slight tweak in the prompt to give the reasoning first and then the answer. Got 8/12 correct. Here are the results:

Correct
Correct
Correct
Incorrect (given B should be A)
Correct
Correct
Correct
Incorrect (given D should be B)
Incorrect (given C should be A)
Correct
Correct
Incorrect (given D should be B)

tohrnii commented 8 months ago

Code to generate the images (courtesy gpt-4):

import os
import random
from PIL import Image, ImageDraw, ImageFont

def draw_text_with_outline(draw, text, position, font, text_color, outline_color):
    # Draw outline
    x, y = position
    for adj in [(-1, -1), (-1, 1), (1, -1), (1, 1)]:
        draw.text((x + adj[0], y + adj[1]), text, font=font, fill=outline_color)
    # Draw text
    draw.text(position, text, font=font, fill=text_color)

def label_and_combine_images(input_dir, output_dir):
    # Create the output directory if it doesn't exist
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Loop through each subdirectory in the input directory
    for subdir in os.listdir(input_dir):
        subdir_path = os.path.join(input_dir, subdir)
        if os.path.isdir(subdir_path):
            image_paths = [os.path.join(subdir_path, filename) for filename in sorted(os.listdir(subdir_path)) if filename.lower().endswith(('.png', '.jpg', '.jpeg'))]
            # Randomize the order of the images
            random.shuffle(image_paths)
            images = []
            # Loop through each image file in the randomized list
            for i, img_path in enumerate(image_paths[:4]):
                img = Image.open(img_path)

                # Draw label on image
                draw = ImageDraw.Draw(img)
                # Use a TrueType font with a larger size
                font_size = 60
                font = ImageFont.truetype("arial.ttf", font_size) if os.path.isfile("arial.ttf") else ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", font_size)
                label = str(i + 1)
                text_color = "white"
                outline_color = "black"

                # Draw the text with outline on the image
                draw_text_with_outline(draw, label, (10, 10), font, text_color, outline_color)

                images.append(img)

            # Combine images vertically
            widths, heights = zip(*(i.size for i in images))
            total_height = sum(heights)
            max_width = max(widths)
            combined_img = Image.new('RGB', (max_width, total_height))

            y_offset = 0
            for img in images:
                combined_img.paste(img, (0, y_offset))
                y_offset += img.height

            # Save the combined image
            combined_img.save(os.path.join(output_dir, f'{subdir}_combined.jpg'))

# Example usage
input_directory = './data/screenshots'
output_directory = './results/combined_images'
label_and_combine_images(input_directory, output_directory)

tohrnii commented 8 months ago

With some more tweaks to the prompt and workflow, I think it should be possible to get 12/12 correct.

r00dY commented 8 months ago

@tohrnii would you mind sharing your better prompt? Very curious!

tohrnii commented 8 months ago

It wasn't really much better. I just wanted to test the hypothesis that the problem with the original prompt was that you asked the answer first and then the reasoning. So I wanted to tweak your prompt as little as possible to test the hypothesis. Here's the prompt:

Hey, here a few screenshots of a section from a webpage. Each section has the same content but is designed a bit differently. Sections are labelled "1", "2", "3" and "4".
Only one of the sections is correctly and cleanly designed, the rest have some obvious design flaws. Tell me which one is correct. Think step by step and finally based on your reasoning give the answer in the format #1, #2, #3 or #4.

The prompt I tried is actually very bad so it seemed reasonable to assume that with better prompting it should be able to get to 100% on this test. I'm much more interested in testing open source vision models so did some more testing with better prompts on llava-1.5-13b. It was surprising to see that it performed much better than random which is not what I was expecting and it's reasoning was quite coherent in a few cases. For this I basically included some basic design principles in the prompt (did some iterations on that).

I think a basic improvement that can be done in the prompt is to either include some basic design principles in your prompt or ask gpt-4 to itself write some design principles and critique the options based on those etc. It's important to give as much time/tokens to these models to get a better answer.

tohrnii commented 8 months ago

Also, I used the chatgpt interface instead of the API so it's possible there are some differences. The model on their interface could be better tuned for instructions so you can more easily get away with bad prompts.

r00dY / ai-design-benchmark

Better prompting might give better results at 0 shot #1